Aug
30
I probably hadn’t thought about canonical correlation in twenty years, but then a problem came up this week where it was the exact technique I needed. What made me laugh, though, is the particular problem I was dealing with twenty years ago had school achievement measures – tests of English, Mathematics, Science and Social Studies – as the dependent variables and the problem I was dealing with this week had, you guessed it, school achievement measures as the dependent variables.
So, I thought I’d ramble on about canonical correlation for a while…
Canonical correlation is used when you want to maximize the correlation between a set of X variables and a set of Y variables. For example, you might want to know how much teachers can affect student performance. You have a set of teacher factors; years of experience, percentage of time spent in hands-on activities, percentage of time spent on classroom discipline, minutes per week spent on preparation, minutes per week spent on grading. You have a set of student outcome variables, math achievement, reading achievement and science achievement.
You could do three multiple regression equations and maximize the explained variance in each dependent variable individually. However, hypothetically speaking, what if you found that increasing classroom structure increased achievement in science but decreased it in math? If you’re an elementary teacher, it’s probably hard to relax and increase the classroom rules during the day depending on the subject. School achievement is one of the few good candidates that leaps to mind for canonical correlation because you have multiple dependent variables and it is hard to argue one is more important than the other. We want kids who can read AND do math.
In a simple linear regression, we are calculating the covariance between two variables, X and Y. (The standardized form of the covariance is correlation.)
In a regular multiple regression equation we are trying to select the set of regression coefficients that maximize the covariance between a set of X variables and a single variable. To get a multiple correlation, we apply those regression coefficients to the X variables for each individual. We get a predicted score for that individual, the Y-hat. The correlation between the predicted Y and the actual Y is the multiple correlation.
In a canonical correlation, we go one step further. We have a set of X variables and we are trying to maximize the covariance between TWO matrices. (If you remember your normal equations from college, with multiple regression you had an X matrix and a Y vector – a vector, in statistical terms, being just a column of numbers, not to be confused with the geometric term of the same name, just like one should not confuse a three-way interaction in ANOVA by the event of the same name in pornography, of which I hear a great deal more of the latter than the former can by found on the Internet. Extremely odd when you consider that the initial motivation for development of the Internet was assistance of scientific research and not distribution of pornography. It’s true. You can look it up.)
ANYWAY … multiple regression maximizes the covariance between the X matrix and the Y vector while canonical correlation maximizes the covariance between the X matrix and the Y matrix.
I was going to say more about this but I have to finish my second WUSS paper on procedures. Speaking of SAS, if you wanted to do a canonical correlation with SAS, it’s very easy. You simply type:
proc cancorr data = datasetname ;
var first-list-of-variable names ;
with second-list-of-variablenames ;
Of course, there are a ton of options. One point really worth making is that you can analyze the covariance matrix, correlation matrix and other types of matrices. This is useful because listwise deletion is a common problem in analyses with a large number of variables, that is, if a person is missing just one out of ten variables he or she is dropped from the analysis. So, if you have ten variables each of which are only missing 4% of the data you can easily end up with 20-40% of your subjects dropped from the analysis. (It would be a little odd if it was 40%, but that’s another topic.)
Speaking of SPSS, even though we weren’t, although there are usually pointy-clicky things for just about every statistical procedure in SPSS I could not find it for canonical correlation. No big deal, just open a syntax window and use a MANOVA statement, like so (this uses the example from the anorexic data set included in the SPSS samples).
MANOVA weight mens fast with binge vomit purge
/discrim all alpha(1)
/Print = sig(eigen dim) .
I would like to say a lot more about this but I promised to have a paper on procedures novice programmers need to learn and I am kind of guessing that the conference organizers would give me “THAT LOOK” if I suggested that CANCORR was one of those procedures. I know the exact look. It is the one Maria gave Dennis when he asked her if she had thought of re-setting the programmable random access memory on her computer when she had a problem. She said,
“No, but I thought of drop-kicking it.”

Aug
23
Generalized Linear Models & Why Statisticians Should Not Be Allowed to Name Things
Filed Under statistics | 2 Comments
Statisticians are good at lots of things but naming is not one of them. If Carl Linnaeus had been a statistician the name for camel would be Horse With Hump and for elephant Really Big Horse with Nose based on the fact that both have four legs and people ride them.
Such is the case with the Generalized Linear Model.
Let’s start with the General Linear Model, because it is easier. As I said before, the General Linear Model is general (and also linear and a model). Then I said this “Almost all questions that can be stated:
Is there a relationship between this thing and this other thing?
….and rambled on a bunch.
So, now that we have that down pat … there is the GENERALIZED Linear Model because the general linear model was not general enough for us. You see GLM (well, both are GLM so let’s call it the horse), was based on the assumption that the errors follow a multivariate normal distribution. So a Generalized Linear Model (let’s call it the camel), generalizes from our basic GLM (the horse) to other types of distributions.
Then there are General Linear MIXED models (maybe those are zebras ?).
The basic Analysis of Variance, going back to the original apparently not-so-general-as-originally-believed model, is quite simple. You have one (or more) independent variables that can be broken down into two (or more) groups. Let’s say gender. You have people in two groups, male and female. This is not a sample of all genders. It’s all the genders there are. The same is true if you put people in an experimental and a control group. This is NOT a mixed model, because you only have one type of effect. Hurray you.
You can have a random effects only model. This example from Stanford looks at whether there is variability across brands of beer. With brands as the random effect, eight different measures are taken across six brands of beer. This is a very worthwhile study as it involves drinking 48 bottles of beer. Oh, and repeated measures, too.
What if you had taken 20 people and put ten in one group that got to drink beer and ten in another group and tested them four times, with the first group getting to drink two beers between each testing and the second group having to watch Sarah Palin videos? Now you have a random effect. You do not have all possible levels of people. You have randomly sampled twenty out of the population. There will be an effect of person and an effect of group. So, this is a random effect AND a fixed effect in your model. I presume your dependent variable will be stupidity with the research question what makes you become stupider, listening to Palin or getting drunk. I’m agnostic on this question.
If you have a MIX of fixed and random effects, then it is a mixed model.
Generalized Linear Mixed Models are perfectly cool with heterocatanomic multivariate distributions, that is when you have some predictor variables that have one distribution, say a Poisson distribution, and another that has a normal distribution.
Then, just when you thought it was safe to go read something else, there are Nonlinear mixed models.
My husband asked if there was such a thing as model mixers, where statisticians got to go to parties and mix with models like Tyra Banks.
Lovely daughter number two mutters,
“Have you seen what the people Mom hangs out with LOOK LIKE? The best you could say for any of them is they are good-looking for old people who work on computers all day. So, for model mixers, I’m going with – No.”
Jun
2
SAS ENTERPRISE MINER NOT WORKING? HERE’S WHY (maybe)
Filed Under Software, Technology, statistics | Leave a Comment
If I had time, which I don’t, I would start a series of how-to articles for statistical software and copy the Car Talk scale they use as a guide for whether or not you should attempt a job yourself, from
a. There are two kinds of screwdrivers ?
to
e. I have built a working nuclear reactor out of wood
I was very excited when I heard that SAS On-Demand was going to offer a cloud version of Enterprise Miner for use in teaching, for free, even. “Was” is the key word in that sentence. Should you do this yourself? Well, it depends. This is very far past an “e” on the Do-it-yourself scale. Do you remember the part in Iron Man where the guy built the Iron Man super hero suit out of spare parts salvaged from a plane crash while trapped in a cave? Well, if you’re that guy, you can do it yourself.
Sigh. I can discern from the fact that you are still reading this that you are not going to listen to me and you are going to try anyway. Yeah, I didn’t listen to me either. There are approximately 3,476 steps in getting Enterprise Miner to work. Let’s assume you have a SAS profile, you logged in, you have a user name and password for SAS on-demand and you have set up a course or someone has registered you for a course. If you are lost already, go here:
http://support.sas.com/ondemand/index.html
This is pretty straightforward all of the information you need to set up your account. If you try setting up your account and Enterprise Miner does not work, as in, failed to start, your problem may be that you have the wrong version of Java enabled. You may have been fooled by the system requirements for Enterprise Miner which said: {Warning incorrect information between lines}
==========================================================================
“System Requirements for SAS® Enterprise MinerTM
Operating System(s)Any system that supports the Sun Microsystems Java Runtime Environment (JRE). Typically, this includes Unix, Linux, and various Microsoft Windows operating systems, such as Microsoft Windows XP and Microsoft Windows Vista.
Macintosh operating systems are not officially supported. For information about a possible workaround that you could test, see SAS Usage Note 18131.
Java Runtime EnvironmentJava Runtime Environment (JRE) version 1.6.0_15 or greater.”
=========================================================================
NOT EXACTLY !!!
Do not be fooled that everything you need to know about systems requirements is on the page you get when you follow the link system requirements.
After you log in to your SAS on-demand account and click on a link to install your software you will see a link about configuring your system for Enterprise Miner. CLICK ON THIS LINK AND READ EVERYTHING OR YOU WILL BE SORRY.
http://support.sas.com/ondemand/emconfig.html
- You may be tempted to skip over the part about the Java Run Time environment because you just read the part above under systems requirements and you met those. Do not do that.
- You may be tempted to go to the Sun site and download the latest JRE version. Do not do that either.
Do ALL of the stuff on this page linked above.
Go to cmd and type javaws -viewer.
If you don’t have JRE 1.6.0_18 enabled (and who does?) go to the link on that page and download it. It is < NOT > the latest version.
Follow the directions on the page that I told you to read every word of and uncheck all of the other versions you may have installed so that only 1.6.0_18 is enabled.
Now … try starting SAS On Demand for Academic: Enterprise Miner by going back to that page and clicking on the second link. It should start.
Patience is a virtue.
Enterprise Miner can be really slow. At first, I thought it wasn’t working. I switched to a better connection and a faster computer (it wasn’t hard,I had to roll my desk chair a few feet but being the finely-tuned athletic machine that I am, I managed) . My advice is if you have several computers, use the best one for this. For a lot of things, the speed of connection and how much RAM you have may not make a difference. This is not one of those things.
Getting Your Data into SAS Enterprise Miner
But…. you have no data, do you? Your problem may be that you are not an instructor. Only instructors can upload data to the course. If that’s your problem, there’s not much I can do for you. If you are an instructor, go to the instructor home page > course information. Scroll down and you will find, in about the middle of the page, instructions on how to upload your data. You can use any FTP program. In fact, even though Enterprise Miner does not run on the Macintosh my data happened to be on a Mac and I uploaded it using Fetch. It worked fine.
If your data DON’T upload fine, check the settings on your FTP programs. A lot of organizations have set the default to be SFTP. SAS didn’t seem to like this. I changed it to FTP and my data uploaded happily away.
If you upload a SAS data set, then you and your students will be able to access the data using the LIBNAME statement shown below. You’ll want to include the access=readonly parameter to prevent your students from modifying the data.
libname mydata “/courses/BLAH/BLAHBLAH/THISCOURSE/saslib” access=readonly;
The BLAHs will be replaced with your course specific information. If you are teaching more than one course, when you upload your data and when you use the libname statement, be sure you include the name for THISCOURSE. Otherwise, you won’t be the first professor to have uploaded the data for the wrong course and have a class of very confused students. You won’t be the last, either.
Okay, you have uploaded your data to your directory and opened Enterprise Miner. Now what?
Create a new project. Go to FILE > NEW > PROJECT. Give it a name. I named mine Joe. On second thought, I should have named it Bob, because when you spell it backwards, it’s still Bob.
Open the program editor window. I thought when I went to the FILE menu and picked NEW I would have the option for program, code or something. No.
See that little thing that looks like the program editor window that you wouldn’t have noticed if you weren’t specifically looking for it? That’s it.

Run the Libname statement above, replacing the BLAHs and THISCOURSE to match your actual directory.
Okay, it is running, you have data uploaded, a project open and your library available within the project. The next thing I would do is click on the Help menu (honest) and start reading whatever interests you, like getting started. Unlike most documentation which is written like someone pasted a web page into Babelfish, this is actually easy to follow, well-written and less boring than watching paint dry.
I now have Enterprise Miner working on THREE computers, two using the on-demand version and one with Enterprise Miner for Desktop. Someone should bring me a prize. But no one did.
May
30
In my copious spare time, of which I have none, I occasionally get the urge to actually read technical books from beginning to end.
I think my life took the path of most grown-ups in my field. You get a degree, or two or three or four. Perhaps during the course of that, but certainly at the end, you get what my mother refers to as “a real job”, which is a job outside of a university. In the course of this real job, they require you to do stuff – produce reports, answer questions, write research designs – whatever your real job happens to be. In producing these reports, answering questions and so on, you read PARTS OF the manual. The operative phrase in this sentence being “parts of”. You read the part that tells you how to obtain a Wald statistic using Stata – but you skip the part on what a Wald statistic actually is because you have a meeting at 2:30. You read the article on odds ratios in logistic regression but you skip the part on parallel processing for maximum likelihood methods because you have a report due tomorrow.
So, maybe you have been just skipping over very useful features in software and not having the time to notice. I am sure I must have mentioned this book before,
Programming and Data Management for IBM SPSS Statistics 18: A Guide for IBM SPSS Statistics and SAS© Users. It is very well-written and very free. One of the smartest things SPSS has done is make a ton of its documentation available for free, based, I think, on the reasonable notion that the better people can use its software the more likely they are to buy it. Also, as far as the title, it should be noted that 90% of the book is how to use SPSS and the other 10% is how to use SPSS if you know SAS pretty well. I’ve actually found that section extremely useful.
Anyway, as for aggregrate, which you might think I was going to discuss because that is in the title. Aggregate is an incredibly cool feature in SPSS that you may not have ever noticed. My friend works in an Emergency Room in a large city. She is quite concerned that some people are using the ER for primary care or even just for attention. One evening she said to a patient:
“You have a serious problem because I KNOW YOUR NAME! Do you know what the definition of the word ‘emergency’ is? No one should be in the emergency room so often that they and the staff are on a first name basis !”
Let’s say you work in this ER. You have a database with client records and most clients come once, some of them come more than once. You’d like to attach a variable to each client that is “Number of Visits”. You could then do all kinds of analyses, say, pulling out all the patients with 10 or more visits this year and seeing how many visits that represents. Or, you might want to know how much total time these chronic emergencies take up.
Here is what you do:
Go to the DATA menu and select AGGREGATE
For BREAK VARIABLE select “ClientID” or whatever your variable is named.
Check the button next to NUMBER OF CASES. The default name is N_BREAK but I changed it to “Visits” because that was a lot more obvious.
Check the button next to “ADD AGGREGATED VARIABLES TO ACTIVE DATASET” .
Click OK.
Now I want to know how many total visits were from “chronic emergencies” and how many total minutes they took up in my ER. First, I select out these folks by
From the DATA menu choose
SELECT CASES
Click the button next to IF CONDITION IS SATISFIED
In the pop-up window, enter Visits > 9
Click Continue
Click OK
Go to ANALYZE
Then DESCRIPTIVES
Then select DESCRIPTIVE STATISTICS
Move Length of Visit under Variables
Click on the OPTIONS button
Click the button next to SUM
Click CONTINUE
Click OK
If you prefer syntax to pointing and clicking, here you go:
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=Client
/Visits=N.
COMPUTE filter_$=(Visits > 9).
VARIABLE LABEL filter_$ 'Visits > 9 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
DESCRIPTIVES VARIABLES=Length_Of_Contact
/STATISTICS=MEAN SUM STDDEV MIN MAX.
There are plenty of other ways you could aggregate your data and add the counter to each record but this way is just so simple it is worth remembering.
TRUE CONFESSION: I hadn’t used aggregate in well over a year. Someone asked me how to do this and I was thinking of the LAG function and Proc summary using SAS and there was this dim memory that there was some other way to do it. So, I just started reading the data management book from page one. A lot of it I skipped over, of course. The second chapter, on programming tips and best practices is either new or I had skipped it when I read the book originally. It was good enough to warrant mentioning randomly, which I just did. Anyway, some time after the second chapter I came across the mention of aggregate and it all came back to me.
I may have told this story before and forgotten it, so I am telling it again which is kind of the point. A few years ago, I was writing a proposal on increasing parental involvement in special education. I said to my incredibly helpful research assistant that someone must have published articles on this, I mean, it isn’t exactly an esoteric topic, so please run through a few databases of scientific articles and bring me the references. She came in less than an hour later, laughing, with a list of articles. She said,
“Yes, you’re right. Someone had done research on this. Four of the first twelve references to pop up were by someone named Rousey.”
This is funny because that was my name before I remarried and not only had someone done research on it but that someone was me! Come on, a couple of them were from 1990. Can you remember what YOU were doing in 1990? If you’re the age of a lot of people I meet at conferences and my helpful young research assistants you were probably drinking juice from a box in Ms. Campbell’s kindergarten class.
So, now you know one of the main reasons I write this blog. I’ll vaguely recall something about crontab or aggregate or a geometry column in Proc GMAP and remember that I used that a year or two ago. If I write a post about it, maybe it will be helpful to someone else, and, in 2012, when I need to do the same thing again, I can search my blog and well, what do you know!
May
13
Margin of Error
Filed Under Dr. De Mars General Life Ramblings, Technology, statistics | 2 Comments
On my way back from Tunisia via Paris I ended up in a redneck dive bar somewhere in Georgia reading the New York Times on my Kindle while the lady next to me asked the very drunk waitress if she knew who had won at NASCAR this weekend.
This sounds like the beginning of a joke, but it isn’t.
Yes, it is your fault, Delta Airlines and U.S. Border Patrol, if you’re listening, which I am sure you are not.
The first concern came when I looked at my ticket and saw that I had an hour and twenty minutes between arriving from Paris and leaving for Los Angeles. This did not go well. The passport control computer had some problem which resulted in hundreds of people being stuck waiting for an hour or more to get their passports checked. By the time we got through, we had all missed our flights and the same hundreds of people were sent to Delta Airlines which very unsympathetically said it was not their fault that people missed their flights because it was the federal computers and we all needed to pay for our own hotel rooms and fly out in the morning.
Why is it that we are eager to invest in mortgages, securities and more, with companies assuring us that they can predict the future well but then other companies, with a lot fewer unknowns, swear that they cannot predict problems and “It is not our fault”.
Being the good statistician I am, I started asking people in line who had missed their flights and were getting re-booked, on the shuttle to the hotel and at the hotel how long was the layover between flights. Every one of those people had a layover between one hour and ninety minutes. No one who had a two hour or longer layover was in the very large group of people who missed their flights.
You know how the airport tells you to show up two hours early for international flights? That’s a good idea. You should do that. I have never missed a flight when I came two hours early, although I have sometimes made the plane by just ten minutes or less.
If Delta had a rule in their computer system that did not allow passengers to have flights closer than two hours together when making an international connection, problems like those that happened today would occur far more rarely. They could even have a manual override on that so if you chose to cut it closer, on your head be it. The extraordinarily UNhelpful customer service person at Delta said to me,
“But it was not our fault. Why should Delta pay for your hotel room? If everything had gone right, if there hadn’ been any computer error, then all of these people would have made their flights.”
That is probably true, however, a system that allows for no margin of error, that assumes “everything will go right” is a bad system and it IS their fault. Of course, if Delta implemented my system, they would sell fewer flights. The problem currently is that people like me assumed Delta personnel knew what they were doing, especially given that Atlanta is their hub, and that if, even though the norm is two hours for international flights, they allowed an hour, they must have some knowledge we didn’t.
In fact, it appears that they were willing to accept the risk that hundreds of passengers would miss flights because hey, it didn’t cost them anything but a little aggravation.
I started out my career decades ago as an industrial engineer. Every industrial engineer knows there are two types of hours, standard hours and actual hours. A standard hour is how long it takes to make a widget. You do a time study and figure that it takes five minutes to weld it, an hour for it to be in the cooling area, and another 15 minutes to sand the edges. So, the part takes an hour and twenty minutes in standard hours. Only a complete moron would base their factory schedule or any other plans on that. You see, sometimes, the machine breaks. You run out of parts. The guy who is supposed to be doing the welding is in the bathroom for 10 minutes or out sick for the day. Often the actual hours it takes to get a part done, allowing for machine downtime, operator sick days, parts shortages, and other problems is about double the standard hours.
What about Passport Control? They were the second group that said,
“Hey, it’s not our fault, it’s the computer.”
That has got to be the biggest bullshit excuse on earth when used by anybody. I don’t mean it was the poor guy at the customs’ booth’s fault, but I definitely think if your computer doesn’t work it IS your organization’s fault. You chose to save money by not having a back-up system, but not having enough IT people on staff, by not paying your programmers well enough that you didn’t have a system in place to anticipate this.
If the problem is that there was a failure in accessing the passport database, you can’t tell me there isn’t a back up of that database made nightly. No one thought of having a system where you can switch to the back up?
I can’t say what the specific problem with the computer in U.S. Customs was, but I am skeptical that the problem was unforeseeable and unsolvable, just based on my own decades of experience with computer systems. I find it more likely that there was a decision made somewhere that weighed the costs of possible failure against the costs of back-ups and alternative systems. Since the costs of, e.g, lost revenue from flights or paying extra programmers would be borne by the company/ agency and the costs of staying overnight, missing flights, etc. are borne by consumers, there is an incentive to short-change on customer service.
Computers aren’t delivered by God and programmed by archangels. Organizations make choices in the programs they use, back-ups they purchase and people they hire. If you cut corners to the extent there is no allowance for error then, yes, it IS your fault.
May
10
Explaining Unexpected Sights
Filed Under statistics | 5 Comments
Ronda and the camels
Normally, walking along the beach in the morning with my daughter, I do not expect a random person to come up to us with the question,
“Would you like to ride my camel?”
However, I was not taken nearly as much by surprise as my daughter because I had been to this same beach twice already and although it was deserted both times and, I have to admit, generally much cleaner than the beach at home, there were a couple of piles of, well feces. While distinguishing types of animal feces is not a skill frequently called on in statisticians in Los Angeles, I was pretty certain these weren’t from any small domestic animals or drunken tourists.
I knew there had to be a producer of large feces around here somewhere
So, when the gentleman walked up leading a family of camels, that explained a lot.
After a few hours of camel-riding and sunbathing, I was bored, so I came back up to my hotel room to work on my second paper for WUSS. I already submitted a paper on statistics with Enterprise Guide but I wanted to write something on data visualization, just because, and I figured having a deadline would force me to make some progress.
Now, I knew this was around here somewhere …
Creating a bar graph in Enterprise Guide with bar height = means of a second variable
I usually use TASKS > GRAPH > BAR CHART to create a bar chart and I had yet to spot how to create a bar chart which shows the average of one variable for each value of a second variable. In this case, I wanted to see what is the average income for respondents based on the percentage of African-Americans in their neighborhood.
My original reason for using this was to create a bad example and show that you should NOT have 100 categories. As you will see, it did not work out as expected. In fact, it so did not work out as expected that I tried again with percent African-American residents rounded to to the nearest 10% because I wanted to look at these data again.
I was sure there had to be a way to create a bar chart by means, and when I had plenty of time to look for it, I found two. In the BAR CHART task when you select your column to chart, then under “sum of” select the variable for which you want the means. Next, click the ADVANCED option for the bar chart task. You’ll see an option for “Statistic used to calculate bar”. From the drop-down menu, select average.
[You can also use the bar chart wizard. In step 2, select a variable from the drop-down menu next to bar height. Then click on the sum symbol (the thing that looks like a deformed E) and a window will pop up that lets you select average as the statistic.]
So, I get the chart below and I know it is not supposed to be like that.

Average Income by Neighborhood Percentage African-American
As can be seen from this graph, there is a curvilinear relationship between the percentage of African-American residents in a neighborhood and income (measured on a 1= < 30K year to 8 = > 250K scale).
While this may be true, I don’t think it is. My first thought is that there are probably a small number of respondents who came from neighborhoods that are 70-100% African-American because this was a random sample of around 1,100 people and there aren’t that many completely segregated neighborhoods in the country.
I take a look at a pie chart

Pie Chart of % African-American in Neighborhood
and it confirms my suspicions – those bars to the right which are forming that curvilinear pattern are based on a very small sample. All of those bars from 40% on up, COMBINED comprise less than 7.5% of the total sample.
I have major commitments today – going to the beach, eating breakfast and watching my daughter at training camp, which is the reason we are here in Tunisia.
I am going to look at this more later. I actually did a lot more last night and that is the part that troubles me a bit.
I really looked into this because the results were unexpected. I KNOW I should always examine every aspect of the data carefully, but the truth is, I know that I do more testing, more exploration when the results are not what I expected to find. I wonder to what extent we all do this and how much that contributes to us confirming what we already expected to find, because when we do, we don’t keep looking for other explanations.
May
3
The Emperor’s New Statistics
Filed Under statistics | 1 Comment
I had the pleasure of attending a lecture Rand Wilcox gave on the state of research. He was far more amusing than I expected from a statistician (perhaps this reflects low self-esteem on my part). He made the very valid point that all statisticians learn in the infancy of their careers that the general linear model makes certain assumptions, like normal distribution, measurement without error (give me a break!), homoscedasticity. In fact, there is a very well-written summary in an electronic journal in an article entitled Four assumptions of Multiple Regression that Researchers Should Always Test. (Being far less given to snarky comments than me, there was not a parenthetical addition, “But you never do, do you?”). That was one of Wilcox’s points, that SO often analyses are conducted by people who never test the simplest assumptions.
My favorite comment, though, was
“Anyone who thinks they know all of statistics is certifiably insane.”
This is becoming more and more true. I remember chugging along happily in graduate school doing two- and three-way ANOVAs. Then, all of a sudden, if you were going to do an ANOVA with say, ten schools and compare the impact of whole language versus phonics, you had to do a mixed model and specify curriculum type as a fixed effect and school as a random effect. If you did a regular two-way Analysis of Variance it was WRONG (beatings with bamboo sticks for you.) If you switched from a two-way fixed effects model in this case to a mixed model it was more correct. However, did your results turn out dramatically different? Well, actually, no. Slightly different.
Over the years, I have seen the number of statistical software procedures grow dramatically, from those written by Stata users to SPSS add-ons to whole categories of SAS procedures, e.g. Bayesian. What I have NOT seen is a practical increase in the usefulness of our predictions.
From terrorist attacks to volcano eruptions to financial market crises to mortgage prices to unemployment rates, our predictions are so-so in the short-term (as they often amount to no more than – pretty much like now since all predictors are pretty much like now) and not very helpful at all in the long-run and most effective when viewed in reverse. For example, Mashable tells me that if instead of paying $3,000 for a G4 Powerbook back in 2002 if I had invested it in Apple stock I would now have $94,000. Am I the only one who is thinking,
“This prediction would have been a lot more useful in 2002?”
Another thing I have NOT noticed is more understanding of statistics by the general public (unless the words “more” and “understanding” mean the exact opposite of what I think they mean). This commentary by Bill Maher uses the tea partiers as an example, but would apply to just about any group in America (and a whole lot of the rest of the world, too). Maher points out the complete impossibility of cutting taxes, maintaining services and reducing the deficit all at the same time. He notes that Americans want to cut spending rather than increase taxes but when asked what they want to cut spending on, their usual answer is “Nothing”.
Let’s talk about cutting federal spending; 14% of the budget goes to Medicare, 20% to Social Security and 7% to veterans and federal retirees – to be blunt, 41% of the budget is going to old people, of which we are getting more as the population as a whole ages. Another 6% is going to interest payments on our national debt, which we can’t exactly decide not to pay and another 20% is defense spending. So, now we are up to 67%, or two-thirds of the budget. (These statistics are from the Center on Budget and Policy Priorities. ) And yet, every time I turn on the radio or television, I am bombarded with commentators telling me that the problem is with government “pork”, welfare, that we need smaller government. And yet, again, those same people are not arguing that we should decrease social security, Medicare or defense spending. Just how much do they think government spending can be reduced by cutting the 2% that is spent on scientific and medical research? (Answer: At most, 2%. It wasn’t a trick question.)
As statisticians, we are getting better and better at impressing each other with how smart we are. Maybe we are even getting better at impressing the general public, when they think about us at all.
Many years ago, I decided that my role as a teacher was not to leave the class impressed with how much _I_ know but knowing, understanding more themselves.
I’m not sure we’ve made progress in that direction.
Apr
25
Life is Full of Disappointments
Filed Under Technology, statistics | 4 Comments
I have been trying to get ready for two workshops this summer. One is called Visual Data with SPSS (pretty obvious what it is about). The second one is statistics using SAS Enterprise Guide. I was going to call the first course Statistics without Numbers and the second one Statistics without Programming. A colleague pointed out that what students want is not statistics without programming but statistics without pain. I never quite see statistics as painful in the same way that some students do, but I conceded his point and so that is now the official name on the schedule.
It will, in fact, be a course without programming because I have spent half the weekend so far beating the data into shape. Unfortunately, this is going to be a bit misleading for students because in real life data don’t come nicely packaged. There are a couple of things that I have not found a way to do in SAS, SPSS or Stata using a point-and-click (GUI) interface.
Chief among these is array processing. Say, for example, I want to recode all 90 questions on a survey to have both -1 and 8 as missing values. The closest you can do this is in SPSS with the TRANSFORM > RECODE menu options and it does remember the previous values you entered for old and new values. Still, it’s much quicker to just write the syntax for it. Same with Stata and SAS. If there’s a way to do it quickly, I have yet to discover it.
One idea I stole from Dreamweaver is snippets, little bits of code you store to do specific little tasks, like create a form button. Probably the most common “snippet” I use in SAS is the array/ do – loop
data in.visualdata ;
set in.visual ;
array redo{*} _numeric_ ;
array nxt{*} q1 — q920a ;
Do i = 1 to dim(redo) ;
if redo{i} = -1 then redo{i} = . ;
end ;
Do j = 1 to dim(nxt) ;
if nxt{j} = 8 then nxt{j} = . ;
end ;
Above, the data used -1 for missing data for all of the numeric variables, so it was easy enough to take care of that. However, for some questions, 8 was coded “no opinion/ don’t know” so I wanted that to be missing also, but for other questions 8 was a valid value. So, I needed two array statements and two do-loops.
I did not see any way to do this without programming other than 90+ pointy-clicky things. Not.
I have similar “snippets” that do the exact same thing for Stata and SPSS.
Another disappointment in Enterprise Guide in particular is the lack of a convenient where clause. I would like to only analyze cases where the respondent selected Obama or McCain as the likely candidate in the election. I could easily use the QUERY feature in SAS EG , create a computed column, recode into a new column called vote2008 and now have three values, missing, Obama and McCain. However, if I wanted results only on those who had selected Obama or McCain I would have to use the Filter & Sort feature and create a new dataset, I thought perhaps there was a WHERE clause and I had missed it.
So, I googled “SAS Enterprise Guide” WHERE clause and was linked to a post that ironically mentioned my blog saying that I can’t see a lot of experienced programmers switching to Enterprise Guide. [In an aside here, I should mention I did not get as much hate mail as from the R people, just some snippy comments from the SAS EG folks about how I am "old". Having survived the adolescence of three daughters and a fourth now on the brink I have developed immunity to all such comments . Moo ha ha <---- Evil scientist laugh, in case you didn't recognize it.]

Eva, Supergenius Baby
Besides, when you’re old, you get to have grandchildren as compensation. So, it’s all good.
In the comments on the blog that commented on my blog (are you lost yet?) was a discussion of the disappointing absence of the WHERE clause.
I overcame this disappointment quickly because SPSS actually does have something like what I wanted. Go to the DATA menu and choose SELECT CASES, throw in an IF clause and you have the dataset you want to analyze. Then, when you go to the next analysis, you can select different cases if your little heart desires. In another, fleeting, disappointment, SAS Enterprise Guide does not seem to have an option to export to SPSS (or Stata, for that matter), although SAS 9.2 does export to both SPSS and Stata (and about damn time, too). No big deal. I exported it to Excel which will pop open in SPSS no problem.
In yet another disappointment (is the title of this post not “Life is full of disappointments”?”) I could not find a way to make SAS EG do the graph I wanted which was the mean income of people who voted for McCain and people who voted for Obama. The bar chart options kept giving me percentage, cumulative percentage, frequency and cumulative frequency as the only options. Yes, I KNOW I could code it in PROC GCHART but have you ever actually written anything in SAS/Graph? Yuck! It reminds me of when I used to have to write things using Tell-A-Graf to produce plots on our plotters at General Dynamics. (And if you remember any of that, you really ARE old!)
Of course, the course IS entitled Visual Data with SPSS and I was only cleaning up the dataset in SAS EG because it happened to be open.
In the final disappointment that has been going on for a while, actually, I haven’t been able to read in the formats with a .stc file from ICPSR. I contacted them and they suggested running with options nofmterr . This is one of those pieces of advice like yelling “Run faster!” to a runner in a race. It is correct but not really very helpful. My problem is that I wanted to have the formats created so I could use them. Usually ICPSR provides you code in SAS, SPSS or Stata with the formats/ data labels. Not this time. Oh well, that is something helpful, young assistant can do on Monday. Thankfully I will only be using 16 of the bazillion variables.
Anyway, I am over all of it. Tomorrow, after judo practice, I am going to the Renaissance Faire for Mother’s Day. We are going tomorrow because, for the umpteenth year in a row I will be out of town on Mother’s Day. This time, though, it is NOT for work but to watch my next-to-youngest baby compete in Tunisia. Some people, like those that watched her winning this final at the Valentine’s Day Massacre, say she is not such a baby, but she still is to me.
So, yeah, my software isn’t perfect but the weather is lovely and my kids are pretty good. As for my husband, he just brought me up a glass of Chardonnay and it is time to kick back, drink it and read the New York Times (yes, even though I live in LA, I read both papers every day).
Just read a tweet from some young starlet saying,
“You don’t marry someone because you can live with them, but because you can’t live without them.”
And my thought was,
“Honey, you are obviously single. (And put some more clothes on, too.)”
Yes. I AM old.
So, despite the disappointments, I guess I will survive to teach the summer workshops. Who knows, I may even get time to go to the beach.

Apr
21
When Data is Not Art
Filed Under statistics | 2 Comments
I failed art in junior high school. When I tell people that, people who actually have artistic talent, they look at me in disbelief and say,
“No one fails art. That’s one of the great things about art. How could you possibly fail art?”
The answer is that I was very, very bad at it. Part of this might have to do with the fact that I am extremely near-sighted and was constantly losing my glasses and then going without for weeks or months until somehow my mom found the money to buy me yet another pair. The other part, to be truthful, is that I was just very, very bad at it.
Narratives 2.0 has awesome pictures of music tracks, which maybe mean something if you are a musician. Then again, maybe not.
Flowing data is more what I am talking about in terms of data visualization. While some of the graphics are just plain funny (the one on love, for example) , the message of this map, on mortality rate under five, should be obvious to almost anyone.
Often, when I am looking at data, it is something far less artistic. I’ve done a lot of program evaluations over the years, sometimes of programs that were not exactly above board. After all, the staff members reason, I’m flying in from thousands of miles away. They’ll just enter some names and test scores in their database. How will I possibly know?
Here is an SPSS dataset that happened to be lying around. It has the actual data from a project that was supposed to be providing staff training. There was an experimental group, which received the training, and a control group that did not. The first thing I do is select out the control group and plot the pre-test by the post-test. If this is a reliable test, there should be a high correlation between pre- and post-test for the control group, fitting pretty close to a straight line.

The next thing I do is SELECT CASES (found under the DATA menu) for the experimental group. If the training was effective at all, there should be a correlation between the pretest and post-test for the experimental group, but there should be more scatter around the line. Why? Because some people benefit more from training than others. Some come late, leave early and fall asleep in between. Others pay rapt attention and read more about the topic on-line when they get home. Some people with really high scores may have known all of the information in the training and not gained a point. Other people with average pre-test scores may have learned a lot and moved up to a higher score. People who had a very high pre-test score should still have a high post-test score. Hopefully, your training didn’t make them dumber. (Although I think I have attended a training session or two that felt like that.)
So, this is the pattern I am looking for – more scatter on the experimental group, tighter in the control group and those with high scores are more likely to stay high than low or moderate scores are to stay in the same place.

If this is NOT the pattern observed, then you and I are going to have a little chat and try to explore these data further. Personally, most of the time I have found more confusion than corruption. For example, once I was looking at graphs like those above but the relationship for the control group was not quite as strong as I expected and for the experimental group there was more of a relationship than I expected. I said,
“It looks as if possibly someone entered people as being in the experimental group who were actually in the control group and vice versa.”
A couple of the staff members looked guiltily at one another and then one spoke up,
“You know, I never really understood which was which.”
So… with only a minimal amount of sighing and eye-rolling, and a significant addition to our eventual bill, one of our young staff members checked all of the files, sorted them into the correct piles, corrected the database and we re-did the analyses.
A second situation in which, more than once, we have seen a different pattern than expected is when the amount of intervention varies greatly. I may see a clump of people who seem more like the control group. Their scores are pretty much the same as when they started. When you and I have our discussion about your program and I ask about those people it turns out that they were in the group that received therapy, after-school tutoring or whatever but they only came to one or two sessions and then dropped out. On the other hand, those people who actually did come to all 15 or 25 or 40 sessions showed significant improvement. When we find these patterns, we split the clients the project served into two groups – and it is usually easy to see a naturally occurring break – and analyze those who came to X number of sessions or more versus the control group. We also take a look at the people who dropped out of the program to see what information we can provide on the people who are not being reached.
We often expect that the more of a treatment an individual gets – therapy, training, tutoring – the better he or she will do. We don’t consider very often a second factor in there. The more of an intervention a person gets the more the therapist has given and, presumably, the better he or she gets at it. This is going to be especially true with a new program. Of course, if you have been tutoring for 20 years, an extra two years of experience isn’t going to make near as much difference as if you have six months of experience. Starting off my career as an industrial engineer back in the early 1980s (yes, they did have engineers back then), I was more familiar with learning curves than I wanted to be and it surprises me that we don’t think of these in social science very often.
Let’s take a look at this with our same training data. We have training delivered for four groups over a two year period.

For simplicity, I have included only the experimental group above. The first group that was trained, the green line, shows the least improvement, the middle two groups, which were trained in the middle of the project showed more improvement and by far the most improvement was shown by the fourth group (the purple line), trained when the staff had nearly two years of experience on this project.
[For those who want to know, yes, I did do a repeated measures Analysis of Variance with time (pre-test/post-test) as the repeated factor and group (experimental versus control) and training cohort as the between subjects factor. Yes, I did test for a three-way interaction and yes it was statistically significant at p < .001 . Yes, there was also a significant interaction of time by group, with the experimental group improving significantly more than the control group, also at p < .001 . ]
Apr
19
Being a professor can build humility. About twenty years ago, I was teaching the third course in the statistics sequence required of all graduate students. The second course had been taught by an adjunct professor, which was FAR less common then than it is now (that’s a whole different post). The first day I started out talking about multivariate statistics (that being the name of the course) and was almost immediately stopped by a student,
“What is that F-statistic you mentioned? We didn’t learn that in the last course!”
Another student interjected,
“Mean square error? We didn’t learn anything about that!”
And so it went. I told the students this material should have been covered in the previous course, but since it obviously had not been, we would back up and start with Analysis of Variance and other topics they should have learned in the previous course. At the end of the lecture, one young woman stayed behind. She said,
“I just had to say this in defense of Professor X. They may not have learned all of those things last quarter, but I was in that class and he sure the heck TAUGHT all of those things!”
The truth of the matter is that the great majority of our students remember very little of the information we teach them, no matter what we teach or who we are. Try this exercise with some friends who have been out of college a few years or more. Ask them to name all the courses that they took. If they are like most people, they can’t even remember the NAMES of every course much less what they learned in them. Even better, pick a random course, outside of that person’s major, and ask what was taught in it. If you are lucky, they can tell you one, maybe two facts.
Even more humbling after several years as a professor, I moved from academia to what my mother refers to as ” a real job” in the corporate world. When I was doing what everyone does to move up in rank, get tenure and generally prove their worth as a human being in the university world, i.e., publish articles in refereed journals, I, and my colleagues, were convinced that this would work its way down to those in practice in our fields, be it business, education or whatever. When I mention this to friends who are in business, they laugh in my face (strangers feel they need to be polite to you). I have to admit that in over twenty years running a business, the number of business articles I have read that helped my business were startlingly minute.
I have given hundreds of lectures on probability, p-values, power, the normal distribution, multiple regression, standardized beta weights, etc. The vast majority of people in those lectures were graduate students in social work, public policy, education, psychology, history, speech pathology and other majors but definitely NOT mathematics and statistics. In a development of which I disapprove, many graduate students can now get a masters or even a Ph.D. with only one course in statistics and research methods. Incredibly, my disapproval has NOT caused universities throughout the country to reverse this trend. Yes, I can hardly believe it either.
These graduate students, many of them long since graduated, are now making policy, managing programs, running companies and awarding grants. Many of them are very intelligent, logical and extremely knowledgeable about their content area, whether it be speech disorders or social security. What they are not is statisticians. They need to be able to make sense of statistical data and they need to be able to do it better than most of them can now.
Ironically, given the amount of time we spend on probability and hypothesis testing in most statistics courses, many of their decisions do not hinge on generalization to a population. It reminds me of a story someone told me recently about a presentation to a local school district. The speaker discussed the differences between schools in low income and higher income areas, correlations between various school factors and income, and talked at length about p-values and generalization to the population. The superintendent asked,
“What are you talking about? You have all of the records of all of the students in our district – or over 98% of them, anyway – this IS the population. I’m not interested in the rest of the U.S. or the world. You HAVE the population.”
In one or two semesters, I cannot make a person a statistician. I can’t even convince them to regularly read articles in refereed journals, and, if I am honest about it, most of those articles were written just so people could get tenure and really aren’t that useful anyway. What I hope I can do is enable him or her to take data and tell a useful and correct story. Here is a very simple example from a real project. The BP Project (not its real name and not affiliated with British Petroleum) received $1.5 million to recruit and educate teachers for bilingual students. The first thing the Project Director wanted was a picture of the type of students being recruited for the project. At the time when this snapshot of the data was taken they had admitted 408 teacher candidates over a five-year period.

The first thing we can see is that the students are overwhelmingly female. That isn’t surprising since most teachers are female. Although increasing the number of males in teaching isn’t one of the project goals, the director is concerned about these results. She feels that many of the students in the schools where her graduates teach don’t have enough male role models and a discussion ensues with her staff about whether they should be trying to recruit more male students.
The university also has both a joint credential – masters degree program that students may elect and a regular teaching credential option. In looking at the proportion of students who choose the Masters in Education option we can see that it is a minority, less than 15%, who select the masters program. Although this is more than the 10% at the university as a whole, the director feels that her students could do better and discusses with her staff options for increasing the number of students who make this choice.

The next question she wants to ask is why some students select the masters program. Are those who choose the masters program better academically, as measured by their GPA? Are the ones who did not choose the masters program academically sub-par? As we can see from this chart, there is difference between the two groups. In general, when I tell clients,
“The F-statistic for Levene’s test of equality of variances was significant at p < .01. Therefore, you have a t-statistic of 7.3 with 135 degrees of freedom with a p-value of less than .001."
They do not find it very helpful, even though I personally think that is very useful information. (Sadly, I have found that people are willing to pay for what THEY personally find useful and aren’t that interested in whether or not it is crystal clear to me!)

So, I produced this graph using SPSS, which shows that the average student who is in the regular credential program has a GPA of 3.11 , above the 3.0 minimum for admission to the graduate program. The director found this very interesting. She wanted to know why, when the average student who was getting a credential could qualify for the masters program why more did not choose this route. I have no idea. She decided to meet with her students and former students and ask them. On the other hand, it is clear that the masters-credential students do have a higher GPA – nearly 3.5 versus 3.1 – and that this is probably significant in the practical as well as the statistical sense.
Given that this is a program designed to serve disadvantaged students, there have been mutterings from other faculty members that these students do not meet the university standards. This irritates the director for many reasons, not the least of which is that the program is for teachers to serve disadvantaged students, not necessarily teacher candidates who are disadvantaged themselves. She wants to look at the overall grade distribution.

This graph tells her several things. First, it tells her there are probably a few people who had data entered incorrectly because it is very unlikely anyone was admitted with a 1.0 GPA. It looks like maybe 1- 1.5% of the data have errors. I recommend she check that out. Second, the average student has a GPA of 3.16, substantially above the cut-off of 2.50 for the credential program and even above the 3.0 cut-off for the masters program. Further, the GPA distribution is very skewed, which is a good thing, in this case, and expected for a selective program. The overwhelming majority of the students exceed the minimum GPA cut-off.
The final question the director had was about ethnic distribution of her teacher candidates. The university is predominantly white, non-Hispanic but she wondered whether a program designed to prepare teachers of bilingual students might attract more Latino and Asian-American students. This chart produced an interesting picture and one that suggested something wrong with the data.

The most common category is did not specify and the second is “other”. Looking into data collection problems, it was found that, in the initial year of the project, ethnicity was not asked on the student information form. So, to the extent that the first year ethnic distribution may have been different from later years, these data are biased. Was ethnic distribution different the first year? We don’t know.
Even in years the question was asked, many elected not to answer. A lot of hypotheses were offered as to why this was the case. Possibly non-Hispanic students felt they would have less chance of being admitted to the program and did not answer this question. When the staff followed up with some alumni they were told that, in fact, students did expect to experience “reverse discrimination” in the program and were pleasantly surprised this did not occur. Why did so many students list “other” as ethnicity? Some were mixed ethnicity, e.g., African-American and White and did not feel either category fit. Some were Native American or Filipino. Others we have no idea why they put down other. Given the questionable validity of the data, the director was cautioned against using these results to draw any kind of conclusions about the ethnic make up of the program participants.
One course I do remember from graduate school over twenty years ago was Questioning and Teaching, taught by J. T. Dillon, the author of a book by the same name. I haven’t seen him since and I doubt he’d know me if he tripped over me. I do remember a question he asked though, and my answer. He wanted to know:
“How do we know if we have taught someone something?”
I said,
“I think I have taught a person if I have given them the answer to a question they have been wondering about. If it isn’t THEIR question, they probably won’t pay any attention to it and I’m sure they won’t remember. If I don’t answer their questions, I’m just a person who stands in front of a room and talks a lot.”
He repeated,
“A person who stands in front of a room and talks a lot. Young lady, do you realize you’ve just given the description of most of the teaching that occurs in this country.”
I didn’t have an answer to that.

