Several interesting random thoughts from JSM:
From a session by Freeda Cooner entitled, “Bayesian statistics are powerful, not magical”
ways in which Bayes results could be slanted (one hopes unwittingly) were discussed. One point worth repeating is that the validity assumes accurate priors. Kind of obvious, no? Yet, the question is, where did you get your prior probabilities? Did you base them on studies of use of this drug with adults and your current study is of children? Did you base them on studies of a “similar” drug but this is a study of a new drug?
As I said, when you think about this point, it is kind of obvious but I suspect people don’t think about it often enough.
A second interesting point was made by Milo Schield about “causal heterogeneity”. That is, we like to think if we are testing a new treatment that those who live survive because of the treatment (saved) and those who die do so as a result of the failure of the treatment (killed). That is, we act as if there are only two categories. In reality, he says, there are four groups. In addition to the “saved” and “killed” groups there are those who would have lived regardless “immune” and those who would have died regardless “doomed”.
Another point by Schield was that although we always say that correlation does not mean causation we almost always give examples of confounding variables. We say, for example, that although ice cream sales go up along with violent crime, eating ice cream doesn’t cause you to go after your neighbor with a baseball bat, unless perhaps your neighbor is spied eating ice cream off of your spouse. However, when we look at probability in terms of p-values we are really spending most of our introductory statistics courses testing whether or not observed relationships are a coincidence and we should emphasize guarding against coincidence more than confounding.
Personally, I think I do talk about this a lot, so if you do not, feel shame.
Another really interesting idea came from Chris Fonnenbeck. He was discussing putting his code up on github and I thought,
“Why don’t I do that? Why don’t other people in our company?”
I hate to admit that we had just never taken the time to do it, for which I now feel guilt because I do look for code on github occasionally, and, more often, just browse it looking for interesting ideas, or the hell of it.
Seriously, how weird must you be to fall for this shit?
Normally spam is just annoying but today I got one that I cannot imagine anyone could be stupid enough to fall for. It was from Woody Allen, who apparently is a big yahoo fan because his email is email@example.com and the reply to is firstname.lastname@example.org . I guess it is not THAT Woody Allen because the real thing could probably afford email@example.com - although having the domain name woody.com could possibly set one up for a whole different type of spam email.
This Woody Allen apparently works at NCR Corporation. I am sure National Cash Registers is really interested in statistical consulting and has a lot of SAS and SPSS macros they want me to write for them. As they say, they “offer great interest to do a purchase agreement”. The fact that they can’t work a function to pull my name from a list or figure that the number 4 comes between 3 and 5 would never make me wonder if this was a real email.
From: Mr.Woody Allen <firstname.lastname@example.org>
Subject: Please Quote.
Date: July 27, 2012 10:01:17 PM PDT
Having gone through your listed products, we offer great interest to do a purchase agreement with your company. Please get back to us with the following information
1. Prices FOB
2. Payment terms
3. Delivery Period
5. Specified delivery date assuming from the Date of Order.
Please your quick reply will be highly appreciated
Mr. Woody Allen
Purchasing Operations Manager.
Just as the lottery is a tax on people who are bad at math and cocaine is God’s way of telling you that you’re making too much money, I believe spam like this is a penalty on stupid people who use the internet.
I’m in North Carolina this week at a class for professors on Advanced Predictive Modeling using SAS Enterprise Miner. This is the sort of statement that causes The Spoiled One & Co. to make faces that look like this:
Despite my failure to impress fourteen-year-olds, I think the class has been well worth it.
I’m not naive, I realize that taking a class on a SAS product by a SAS instructor at the SAS facility represents the best case scenario. You know those IT people you always want to smack who say,
Well, I don’t know what’s wrong. It works on my computer.
Yeah, because your computer has maxed out RAM, the latest software and is in the same building as the server. Given that, if it doesn’t work on the the IT computer, you’re totally hosed.
So, if someone else has installed it for you, you have it running on a powerful computer with the latest version (7.1) and NOT the scaled down SAS On-Demand version and it doesn’t work, I would say Enterprise Miner is hopeless.
It’s not hopeless. It worked fine. All the problems I had were due to my own stupidity, like using a factor of .068 instead of 0.68 or forgetting to type SUM when writing a sum function and stuff like that.
On the other hand, those are the kind of problems that are quick and easy to find and fix.
There is a lot to like about SAS Enterprise Miner. Take this nifty little example from VARIABLE CLUSTERING
Think of it as a graph of a principal components analysis.
You can see which variables load together very easily. Imagine explaining this to a client over a rotated factor pattern.
The two best things about Enterprise Miner are diagrams like the one above and the enormous number of data mining procedures it offers.
The two worst things about Enterprise Miner are related to one of the best. Because it can do so much, learning it is quite complicated. I had thought I might use it in my class this fall but it is clearly going to be too much on top of what I have already planned. I am sticking with SAS Enterprise Guide. I am still toying with the idea of teaching a business school class in the spring using data mining, so if I do that Enterprise Miner is a possibility.
The worst thing about EM used to be the installation. You had to have 1,024 GB of RAM, sacrifice a flamingo and get your registry updated by seven of the twelve apostles before it would install. I just today met the first person who told me he had the damn thing installed – and he now works at SAS, which is probably why they hired him.
Amazingly, at least for the SAS On-Demand version, that seems to be a non-issue. I was in a hotel with a crummy wireless, using my old laptop that has 32-bit Windows 7 installed on bootcamp on a MacBook Pro. Enterprise Miner 7.1 started up and ran fine once I had the latest version of Java (and only one version of Java) installed. This incidentally isn’t the Java version they say is guaranteed to work, but it did.
Besides the complication of learning and installation, I think the other big drawback of EM if you are not at a university or college (for which it is free) is that it cost about a scadzillion dollars (that’s a zillion squared plus a bushel, 40 leagues and a peck, for you Europeans reading this).
The SAS On-Demand offerings are a good first step, but I think SAS is missing an opportunity to market to people and companies who don’t have more money than a Romney Super-PAC.
So that’s the good and the bad in a nutshell. The neutral is that more SAS programming was involved than I had expected. This did not bother me in the slightest but it might perturb you if you are used to SAS Enterprise Guide where you can go pretty much forever without programming (whether you should or not is a different issue).
The class was fun, so fun that I am now seriously wondering if I can clear my schedule enough to teach a class in the spring just so that I can inflict it on some unsuspecting graduate students. I think that might be fun, too.
Tried reading in a file with 360,000+ records times 279 variables with SAS On-Demand using SAS Enterprise Guide. It was on one of my office computers that has a pretty slow Internet connection. I was using a different computer at the time so I just let it run. After 29 minutes, I gave up, did it on SAS Enterprise Guide installed on a laptop and it was done in a minute or less.
Lesson learned: Don’t try running huge files on SAS On-Demand
The positive side is once I reduced that file to the subset of 8,000 records and 20 variables that I needed, everything ran perfectly on SAS On-Demand.
Really, that’s not unreasonable performance. If you are doing a project for a class, you probably are not going to have an enormous data set in either length or width. For anything of reasonable size – and 8,000 records is pretty good size for a class project – SAS On-Demand works pretty well.
However, I have never been accused of being a reasonable person and the first thing I think when I get any new piece of software is to try to break it. As my grandmother said, this is why you cannot have nice things.
I’m in North Carolina the next few days for a class on predictive modeling with SAS Enterprise Miner. I thought I would brush up on it a bit before class, so I tried to install it on boot camp on my Mac desktop.
First problem occurred when I tried to update Java on my desktop. Instead of updating the version of Java I had, somehow I ended up with two different versions of Java installed. When I tried to get SAS On-Demand with Enterprise Miner to work by clicking on the link from the SAS On-Demand page, nothing.
I thought perhaps I should not have two versions of Java, so I uninstalled the newer version I had just installed and tried again. This time I got a small file downloaded named MAIN. I was expecting that. I double-clicked on it a window popped up that said Java Web Start and said it was downloading and verifying the application Enterprise Miner 7.1
This went on for about 15 minutes and Enterprise Miner never started up. I cancelled it, deleted the other version of Java, and downloaded the very latest version and tried again. Same thing happened. Actually I did not expect the version of Java was the problem but I figured I may as well update to the latest shiny thing while I was at it.
I am pretty certain that the problem is the internet connection is pathetically slow. It seriously reminds me of the days when the only way I could get an internet connection in North Dakota was through dialing up (anyone remember dial-up?) on CompuServe or AOL. I’m not exaggerating.
So … I moved to a different spot in the hotel where (according to Windows 7) my connection strength is now “excellent”. Bingo! Enterprise Miner.
It’s still kind of slow, but it is working.
I’ll be at a different hotel tomorrow (my life story) and try the connection there and again when I get back home. Microsoft and I have different ideas of the speed that should be identified as “excellent”. Apparently, Enterprise Miner agrees with me.
You might think this would deter me from using SAS On-Demand in my course. It may not, though. The course I am considering using it for is an on-line course and I would be doing the demos from my desktop that has an ethernet connection. I THINK the speed will be okay.
It may be a little slow when the students use it on their own home networks but they are getting a mega-expensive piece of software for free, so they need to be a little reasonable.
(I am a big hypocrite here because I have the patience of an infant.)
Really, though, when you finally get it working, SAS Enterprise Miner is pretty cool.
These days, about half of my time is spent in management and collaboration, a fourth on writing code for a game and the remaining fourth on statistical analysis. This isn’t how I’d choose to divide my hours if I was in charge of the universe, but, that’s the way it is.
That is the mad scientist laughing sound I want to make whenever I hear someone cheerfully say,
“With the Internet and iTunes, the price of entry is so low, anyone can make a computer game or app and make a million dollars.”
Let’s start with that price of entry low thing. If you want to make a Bejeweled -imitation then maybe the price of entry is relatively low. Let’s say you want to make some kind of educational game. You need:
- A story – I’ll presume you’re a writer and have a great idea, since everyone I talk to assumes they are a writer and has a great idea.
- Artwork – drawings for everything from the splash screen to your characters to your scenes. Generally, great games have great art.
- Animation – either 2-D or 3-D.
- A game “world” – for some games this can be very simple, like a race track that your cars race down to the end of the game. This is different than animation itself. The world has to have rules. You can’t fly up off the race track. At some point the track ends, it doesn’t go into infinity. There are usually points and ways of getting points. All of that has to be programmed. The more complex the world, the more programming.
- A means of storing data such as usernames and game state so that if I am playing your game and come back to it later, I start in the same place I was before, while Joe Schmoe starts in the same place he was before. A game is not REQUIRED to have a database and many don’t but if you are going to have a game over the Internet that is not really simple, a database is probably necessary. Also, you need to communicate between the game and the database.
- Sounds. When a car crashes, an alien is shot, there should probably be a sound. You might also want a musical sound track that plays at different points. You have to get these sounds from somewhere. Then you need to add them in.
- Some knowledge of the content area for education. Say you are developing a science game for kids in grades three through six. What are kids supposed to be learning in science in those grades? What words are at a third-grade reading level? How will you route kids to the appropriate level? What prerequisite information can you assume they will know?
- Students and teachers to test your game. This is always a humbling experience as you find the parts you thought were so engaging are simply confusing. Someone needs to locate students and teachers at the right grade level, arrange for them to try your game and see what works and what doesn’t.
- Documentation. Surely you are going to write down somewhere the location of the databases, variable names, languages used, what the statements do. Some of that is in the code but not all of it.
- Project management. Unless you are amazingly talented, I will bet you are not a gifted artist, programmer, writer and also happen to have a background in K-12 education. So, you need to breakdown everything in 1-9 (and I’m sure some other parts I forgot about) into detailed of what exactly the artist needs to do to fit with what the programmer is doing and what the programmer needs to do to make sure the educational objectives are met.
So, yeah, every time I hear someone toss off, “You should make an app for that” like it is no tougher than sketching something out on a napkin, I want to slap them upside the head.
The Rocket Scientist and I laugh a lot over margaritas at people who think because they have an idea for something they should be entitled to 30-50% of the profits. As some brilliant person on twitter said,
“Imagine the most beautiful scene you’ve ever witnessed. Now imagine a painting of that scene. The difference between those two is why your idea is only worth 1% of the value of the actual product.”
This is also why, when people ask me if I am worried that someone else will steal our idea, I laugh. The idea is not what is worth the money. It is the design, artwork, animation, programming, soundtrack and the integration of all of those. Maybe $100,000 of our own money doesn’t sound like a lot to invest, but we’re a small company, so it is a lot to us. It was definitely a risk, although it gets to be less of a risk each passing day as we get it done bit by bit.
All of that said, if I knew when I started this project what I know now would I still have done it? Yes, absolutely! One of the biggest advantages and one that made it worth every dime we have put into this is that the Rocket Scientist and I have both learned a ton. We have worked with tools that interested us for years and we never had a chance to use as much as we liked and we have designed and written games kids can play using those tools. Unbelievably, I’m even getting to like SQL a little bit.
At this point, I would say we have a 50% chance of the game really working and making a lot of money and 100% chance of us having learned a lot and had a great time. I guess this explains why we funded it ourselves rather than giving up a share of the company to outside investors. I was at an angel investors meeting a while back and I asked a gentleman what were the most fun companies he had invested in. He looked puzzled and answered,
Fun? I would say the most fun companies would be the ones that make me the most money.
It reminded me of something my business partner had said years ago to one of our employees in a company I co-founded previously,
“Everything we do makes money but we don’t do just anything that makes money.”
That’s my take away message on game design and programming. It is not a way to get rich without working. It’s a way to maybe get richer with a whole hell of a lot of work. And it’s fun.
More than once, I have said that I would never hire someone right out of graduate school. My reason is that graduate students come expecting perfect data sets to analyze, with no data entry errors, normal distributions of all variables, no missing data – and a liberal sprinkling of fairy dust to make it all perfect if it is not. The last new, young Ph.D. I hired complained bitterly about his inability to complete – nay, even start – the analysis I had asked him to do. He fumed,
“There are all kinds of problems with these data. There are 50 different ways people spelled Turtle Mountain Chippewa – TM Chippewa, TMT, TMBCI — did you know Ojibwe is another word for Chippewa? And it is sometimes spelled Ojibwa?”
I nodded. Yes, in fact, I did know all of that.
“I can’t do anything with this study! SOMEBODY should do something about it! Someone needs to fix this!”
I told him unusually politely (for me),
“Why, you’re exactly right. That would be part of why we hired YOU.”
We were fresh out of fairy dust.
A few years ago, I acknowledged the fact that I was being a complete hypocrite because I taught graduate students and so it was partly my fault if they graduated with no experience with real data and unable to program their way out of a paper bag. (Why do we use that expression anyway? Who is so stupid to be stuck in a paper bag and if they were, how would programming help them?)
As I said yesterday, I cannot believe I did not mention using the SASHELP data before. Even though it sounds hypocritical (again) to admit that I start teaching with a perfect little data set from the SASHELP file shipped with SAS Enterprise Guide (including SAS On-Demand), if you were paying attention yesterday you would have learned a fine lesson that while it is great to teach your students two or three things you should try to teach them only one AT A TIME.
So … we start with SAS Enterprise Guide, open up the Server List by
picking a data set. I picked the HEART data set.
What your students see now is the view below. Note the Server list in the bottom left pane.
There are no clerical errors where someone entered 999 instead of 99 for a patient age, missing data only occurs here and there for illustrative purposes. In short, the kind of data set you rarely encounter in real life unless you have put in a lot of work cleaning it. This makes these data sets perfect for use the first day of class when you are just introducing students to the software and you haven’t had time to get a data set of your own prepared for analysis.
The second part of that sentence is not to be overlooked. I taught full-time for many years and there was NEVER enough time to get prepared at the beginning of the semester. These days, I have the luxury of not taking any more work than I feel I could do excellently (as opposed to when I was younger, poorer and happy to be able to do passably decent work as long as I got a paycheck!) STILL … there are crunch times and it is great to have the option of a well-behaved data set ready to go.
Because these data sets were prepared with teaching statistics in mind you should find almost everything you need – dichotomous variables, such as the dead vs alive in the first column above, normally distributed data you can graph for examples, like the age at which coronary heart disease was diagnosed, below.
Produced as part of the CHARACTERIZE DATA task, in case you were wondering,
( TASKS > DESCRIBE > CHARACTERIZE DATA )
So … you can start with some pointing and clicking on pristine data sets to teach your students about statistics, then move into doing the same with real data. I’ve written a lot about open source data here in the past, so let me just say that I am a fan.
Next, you can move to having your students review the code and ease into programming before they know it.
Also, I should mention that I have also taught with both SPSS and Stata and both of those also come with example data sets, although those Stata provides are a bit sparser than the other two.
I ignored my better instincts which told me loud and clear that I should not be writing any papers for conferences because I am BUSY, not busy in the “I am president of the PTA and need to alphabetize my spice shelf”- busy, but busy in the sense of clients who need things done for which they have already contracted to pay me money. (No disrespect to the PTA president but if you fail to show up for meetings your bills will still get paid next year.)
I ignored my past experience which, when I said,
“Oh, I’ll just present the same paper at WUSS that I did at the SAS Global Forum”
shot right back,
“You KNOW you won’t! You never do that! You always say, but it would be better if I did it like this ….”
So, here I am… writing a Hands On Workshop for the Western Users of SAS Software conference. Here’s why ….
SAS offers many benefits for academic instruction. Students can learn programming logic, to read and write in a programming language, how to interpret statistics, data visualization or data mining. Two major barriers to learning statistics using SAS are the difficulty and the expense. For many students, the need to learn both statistics and a programming language at the same time presents a daunting challenge.
I too often hear from people who scoff at that last remark and argue that students are just lazy, or not bright enough to “make it”.
That attitude reflects a combination of arrogance and ignorance.
I am making the assumption – and I am right – that people who make such comments know both statistics and programming, although in further conversation it almost always turns out that they did NOT, in fact, learn both simultaneously. However, they insist, they COULD have. This is the arrogance part.
The ignorance comes from not having taught statistics. Like almost all of life, it’s harder than it looks. (The exceptions, I have found, are sex and swimming in the ocean. Both of those are easier than they look.)
Many years ago, my friend, Dr. Nina Parker, allowed her microbiology students to re-take an exam that the majority of them failed because, she said,
“It must have been my teaching because not that many people can be that stupid.”
Also many years ago, I read an article on teaching mathematics that began:
The first rule of teaching is to have something to say.
The second rule of teaching is that when you by happy chance have two things to say, to say first one and then the other and do NOT, for the love of God, try to say both things at the same time.
So, yes, my whole point here is that it is easier to learn statistics while pointing and clicking BUT the nice thing about SAS Enterprise Guide is that it gives you the code that was created in a code window, so you can be learning programming at the same time. Wait – what? Isn’t that inconsistent with what I just said? Kind of yes, kind of no. The thing is, your ability to do the statistics correctly does not hinge on your ability to get the program to run.
You can select TASKS > REGRESSION > LOGISTIC and get the logistic regression results. After that, you can also look at the code and see how it was produced. Yes, you can do that with SPSS and Stata also.
A second barrier for many who want to learn is the cost of a SAS license. What to do? Try SAS Enterprise Guide with SAS On-Demand, available free (for higher education) or very low cost, although the SAS On-Demand for Professionals is a limited license. This is a great thing, in my opinion.
I’m not one of those who have almost a religious aversion to paying for software. I just bought a copy of Dreamweaver, and not long ago paid for Webstorm. Both of those saved me a bunch of time. And no, I don’t get diddly squat from any company I mention on here. Not even a crummy 5% discount coupon.
Still — the commercial license of SAS, which is approximately one kidney, your first born and 11 zillion dollars – is one thing if you have a company with 12,000 users, that’s only like a few hundred kidney cells each – but it can be a prohibitive for junior professionals just starting out. This puts new people in a bind. On the one hand, they want to learn this expensive software so they can be one of those 12,000 users making a decent salary, but you’d need a hefty salary to be able to afford a license to play with.
So, it is worth checking out. You are welcome to come to my session at the Western Users of SAS Software conference. If you haven’t done it yet, you have two more days to apply for the Junior Professionals Award and a few weeks yet for the student award. Get on it! Both provide a FREE conference registration.
Here is an example of the type of stuff that will be discussed on how to learn statistics and make it interesting (really!)
When I was looking for an example for this blog, I was shocked to see I had not written about the SASHELP directory in the past and how to use that for learning statistics.
Speaking of COMPLETELY random … expect more blogging from me because my late night Chardonnay writing spot is now available. I co-authored a book on matwork for judo, grappling and mixed martial arts and just got the manuscript off to the publisher on Monday – woo-hoo. (And you thought all jocks were dumb.)
My father passed away about 18 months ago. We hadn’t spoken much of the past 40 years, for reasons that aren’t at all important any more. My mother is living alone now for the first time in 57 years, and really did not need two houses, one in Illinois and one in Florida, so she is moving everything out of the house she lived in since I was eight years old. All of that is to say I have had more contact in the past year with the people and places I knew as a child than I had in the past four decades combined.
It’s been – interesting. The picture above is taken at Peggy’s Cove where I used to go skip school, before they built all the tourist stuff there. My friends and I would put our clothes on the rocks, swim naked in the cove (yes, it was cold), and then dry out in the sun before getting dressed again and heading home. Obviously, if our clothes were wet my aunt (who I lived with then) would know I hadn’t gotten that way at school.
Later, we went to the Citadel, and I told my husband this ammunition room was one of my favorite hiding places when my friends and I would skip school, slip by the soldiers (they’re not too on guard against the yanks attacking any more) and play hide and seek throughout the old fort. He said,
“My God, did you ever go to school?”
I was hardly a promising child. The fact that I made it to adulthood – period – much less, into my fifties, was grounds for mild astonishment to many people. That I did it without a single felony on my record was somewhat more surprising, and that beyond that I actually got a Ph.D., won a world championships, founded a couple of companies and raised what appear to be relatively normal children was a source of continual amazement. Not that the family wasn’t pleasantly surprised that I did not turn out to be a criminal mastermind or serial killer. But they were surprised.
I had no desire to do one of those southern novel type of things where I tell off all of the classmates who treated me like I was “less than” because they had money and I didn’t, insist on being called “Dr” by the teachers who told me after I dropped out of high school that I was going to end up in prison . It was funny to drive through the town where my mother lived and realize that I could probably buy two or three of those big, fancy houses I used to walk by and envy on my way home from school. (Don’t be too impressed. Housing prices in small towns are pretty dirt cheap and three of those big houses together wouldn’t cost as much as a nice condo in Santa Monica.) The truth is, I didn’t care. I went to mass with my mother and I am sure several of the people in the church must have been some of those kids I went to elementary school with. After all of this time, I couldn’t recognize a single one of them and it wouldn’t have mattered to me if I did.
The first time I went back, shortly after my father, died, a wise woman I worked with asked me how it went. I gave some noncommittal response and she said,
“I guess it is okay to go back occasionally, but there is great comfort in coming home to the lives we have made for ourselves, isn’t there?”
I had never thought about it that way, but I think she said something very profound. (I told you she was a wise woman.)
There is great comfort in the lives we have made for ourselves.
I’m sure a psychoanalyst would find something pathological in my disinterest in confronting my past. I think Katherine Hepburn, in “On Golden Pond” said it well,
Don’t you think that everyone looks back on their childhood with a certain amount of bitterness and regret about something? You’re a big girl now. Aren’t you tired of it all? Bore, bore. It doesn’t have to ruin your life, darling. Life marches by, Chels. I suggest you get on with it.
The past does bore me. You can’t change it. It is what it is. I do have a point, though, otherwise I would not be writing this. I see young people – and some old people – who spend their lives caught up in self-pity or anger over the things they didn’t have, the people who were unkind to them, the very unfairness of events that happened to them. I wonder if they enjoy wallowing in self-pity and telling themselves they are helpless.
Do they not understand that there can be great comfort in the lives we make for ourselves? Because there can.
And I suggest they get on with it.
With the new SAS On-Demand for Academics, I presume there will be a lot of professors who have a teaching assistant, research assistant or intern preparing the data for examples for their classes. Or, you may be co-authoring a paper with one of your colleagues.
Let’s suppose you are working on a SAS Enterprise Guide project with someone else, someone who may or may not be named Chelsea. This person sends you a nicely done up SAS Enterprise Guide project to use for a paper, that may or may not be presented at the Western Users of SAS Software conference in September. However, since this hypothetical person has the file on their computer and you have it on yours, the path will not be quite right.
You COULD upload the file to the class directory, if it is for a course. However, there is an easier way if you just want to access it in 5 seconds or less.
First, right-click on the data set that you want to change.
A drop down menu will, um, drop down, which has, among other things, a choice for PROPERTIES
and another window will pop up
This window shows the file name and path. Click on the button that says CHANGE to, um, change it.
Now all you need to do is like any other time in your life you are opening a file in Windows. A window will pop up that lets you go the directory where your file is stored. Click on the file you want, click on the OPEN button at the bottom right of the window and Voila! You have re-set the data set for your SAS EG project. Now you can re-run it so that you have really cool results, like this:
If you, too, want to be one of the cool kids on the block running logistic regressions with SAS Enterprise Guide, come to my hands-on workshop at WUSS. There will be prizes. (If Chelsea reminds me to bring them.)
Once upon a time there were statisticians who thought the answer to everything was to be as precise, correct and “bleeding edge” as possible. If their analyses were precise to 12 decimal places instead of 5, of course they were better because as everyone knows , 12 is more than 5 (and statisticians knew it better, being better at math than most people).
Occasionally, people came along who suggested that newer was not always better, that perhaps sentences with the word “bleeding” in them were not always reflective of best practices, as in,
“I stuck my hand in the piranha tank and now I am bleeding.”
Such people had their American Statistical Association membership cards torn up by a pack of wolves and were banished to the dungeon where they were forced to memorize regular expressions in Perl until their heads exploded. Either that, or they were eaten by piranhas.
Perhaps I am exaggerating a tad bit, but it is true that there has been an over-emphasis on whatever is the shiniest, new technique on the block. Before my time, factor analysis was the answer to everything. I remember when Structural Equation Modeling was the answer to everything (yes, I am old). After that, Item Response Theory (IRT) was the answer to everything. Multiple imputation and mixed models both had their brief flings at being the answer to everything. Now it is propensity scores.
A study by Sturmer et al. (2006) is just one example of a few recent analyses that have shown an almost logarithmic growth in the popularity of propensity score matching from a handful of studies to in the late nineties to everybody and their brother.
I tend to be on the cynical side when it comes to new techniques, and sometimes I’m completely wrong about differences from older techniques being a bunch of useless trivial crap someone over-hyped to get seven articles published so they could get tenure. Not long ago, I was working on a project that used both item response theory and multiple imputation.
Sometimes, but not always.
This brings us to propensity score matching. I kind of disagree with the article by Sturmer and friends. They say,
Only 9 of 69 studies (13%) had an effect estimate that differed by more than 20% from that obtained with a conventional outcome model in all PS analyses presented. … Publication of results based on propensity score methods has increased dramatically, but there is little evidence that these methods yield substantially different estimates compared with conventional multivariable methods.
Actually, 14 of the studies found differences greater than 20% (read their whole study, not just the abstract, to understand the discrepancy, you slacker!) Let us split no hairs, however, and agree for the moment that 13% it is … if more often than one out of every eight times the results differ by at least 20%, I’d say a technique is worth doing. I think the Sturmer gang has called the glass empty when it is actually one-eighth full.
HOWEVER … I do agree that the the call for increased complexity can go too far.
The moral of the story, so if you are tired, you can stop reading after this:
Given that there are times that propensity score matching does make a substantial difference in the results, if you go with the simplest, most understandable method of propensity score matching the results will be pretty much the same as if you went with the most difficult, most obscure method.
Recently, I gave a talk on propensity scores and the results in the example I used replicated what I and many others have found time after time. That is, whether you do quintiles or nearest neighbor matching or calipers, you get pretty much the same result. This is relevant because as I will show in the next blog post or two, when I get around to it, doing a quintile match is relatively easy. “Relatively” is the key word in that sentence. If you find logistic regression easy you will find propensity score matching on quintiles easy. Nearest neighbor matching is harder and calipers is harder still.
In quintiles, you divide your sample into five groups, the 20% LEAST likely to end up in your treatment group is quintile 1, the 20% with the GREATEST likelihood of ending up in your treatment group is quintile 5, and so on. You match the subjects by quintiles. So, if 12% of the treatment group is in quintile 1, you randomly select 12% of the control subjects from quintile 1. You can easily do quintile matching in SAS with PROC LOGISTIC, PROC UNIVARIATE and a few DATA steps.
In nearest neighbor matching, as the name implies, you match each subject in the treatment group with a subject in the control group who is nearest in probability of ending up in the treatment group. This would be really difficult to do in SAS without some knowledge of macro programming. Stata has a psmatch2 command. I’ve read some interesting discussion on it . There is also an nnmtach command. I haven’t used either myself. Both appear to be .ado files and similar to the SAS user-written macro. Suffice it to say that nearest neighbor matching does not avail itself of basic statistical procedures in the same way that quintiles does.
Then, there is the calipers (radius) matching, that uses the nearest neighbors within a given radius. Attempting this in SAS without the use of macro programming will just drive you insane, and your neighbors with you. Earlier this year, I rambled on a great deal about how you could it using a macro.
Inter-American Development Bank vs. AnalysisFactor
Recently, I read an article by Heinrich, Maffiolo and Vasquez of Inter-American Development Bank who said
…. alternative approaches have recently been recognized as superior to pairwise matching. In contrast to matching one comparison group case with a given treated case, it has been found that estimates are more stable (and make better use of available data) if they consider all comparison cases that are sufficiently close to a given treated case.
To which I reply,
Superior is in the eye of the beholder.
I also recently read a post by The Analysis Factor, a.k.a. Karen Grace-Martin, on when to fight for your analysis and when to give in. I really loved this point she made:
Keep in mind the two goals in data analysis:
- The analysis needs to accurately reflect the limits of the design and the data, while still answering the research question.
- The analysis needs to communicate the results to the audience.
Given that quintiles are far easier to communicate, including both of these goals, I would say MOST of the time the quintile method is superior and almost all of the rest of the time the nearest neighbor method is superior. The only time you’d really benefit from the methods such as radius matching is when the nearest neighbor is often really not very near at all. And in that case, I would question the wisdom of doing propensity score matching at all.