Of course I am most grateful for my family. As daughter number two, a.k. a. , “The Perfect Jennifer”, commented yesterday,
“This is the only family I know where everyone in the family actually talks to one another.”
It’s true we don’t have any made-for-TV movie problems. No one is in rehab, no divorces, incarcerations, homelessness, domestic violence. The depth in our household is the occasional excess whining.
On the other hand, after reviving from a food-induced coma the various daughters had other plans. Jennifer was heading out to a club to catch up with some friends. I was surprised that any place would be open on Thanksgiving Day, but Jenn pointed out that plenty of people don’t have families, and other people have families that drive them to drink.
So, I went back to some notes I was writing on matrices and realized that I am also extremely thankful for people who take the time and effort to make their knowledge freely available on the web. I was extremely skeptical of the announcement this week that President Obama is supporting STEM (science, technology, engineering & mathematics) education. I picked this link out of the 500+ on the web because it included the interesting comment that most parents would rather talk to their children about drugs than mathematics and science.
While I wish the best of luck to President Obama and all of his corporate cheerleaders, I think the preceding statement is one half of the reason I suspect nothing will come of this. The other half is that the vast majority of teachers I have met don’t want to teach STEM and don’t want to learn it.
This makes me doubly thankful for those who are good teachers and generous enough to share themselves. Let’s take a simple tour of some nice websites with a topic, say, matrices.
Start with onlinemathlearning.com – the videos are excellent for a student who already has some interest in math and perhaps a basic understanding. There are no animated leaping leopards from rain forests here. You know, I am not sure those help. At worst, they give students the message that math in itself is not inherently interesting enough to learn. The onlinemathlearning site gives this explanation of a singular matrix:
“If the determinant of a matrix is 0 then the matrix has no inverse. It is called a singular matrix.”
This is followed by a very understandable video which shows that to invert a matrix one needs to multiply by 1 divided by the determinant. If the determinant is zero, it can’t be done. For those who did not know what a determinant is, that is explained in an earlier page. (Really, you should take a look at the video. It is quite a nice explanation.)
Somewhat surprisingly, given all the dissing it gets in academic quarters, wikipedia has some great math and statistics articles. For example, this one on positive definite matrices gives the following definition,
followed by some equally understandable examples. [Note: For those of you who are shaking your heads and saying, ‘THAT’S understandable?’ , trust me that I left out a lot of websites that seemed to be written with the attitude that if you didn’t already understand everything about mathematics it was your own damn fault and too bad.
A symmetric matrix by the way, is not, like you might suppose, simply one where it has the same number of rows and columns. No, rather it is a particular KIND of square matrix where the matrix equals its transpose. ]
If you would like to know a little bit more about positive definite matrices than you get from wikipedia, you can check out this page on “Not positive definite matrices, causes and cures ” . This is a link from Ed Rigdon’s SEM FAQ page. (It might be said that frequently asked questions about structural equation modeling is an oxymoron, unless the question is ‘What the hell is structural equation modeling?’ )
Now, it no doubt represents a failure in my education that I do not know who Ed Rigdon is. On the other hand, I’m pretty sure he doesn’t know who I am either. Regardless of our mutual non-acquaintance, he gets major kudos from me for the following statement (emphasis added) wherein he touches on one of the major flaws in much use of statistical software today:
“Now, some programs include the option of proceeding with analysis even if the input matrix is not positive definite–with Amos, for example, this is done by invoking the $nonpositive command–but it is unwise to proceed without an understanding of the reason why the matrix is not positive definite. … Sample covariance matrices are supposed to be positive definite. For that matter, so should Pearson and polychoric correlation matrices. That is because the population matrices they are supposedly approximating *are* positive definite, except under certain conditions. So the failure of a matrix to be positive definite may indicate a problem with the input matrix.”
Just because you can do something doesn’t mean you should. LISREL quite sensibly quits under the circumstance when the covariance matrix is not positive definite, and issues you a message to that effect, at which point you should feel shame.
My point, which I do have buried in here, is that STEM education is not about “making science cool”, it is about understanding stuff.
Here is what I am going to do right now for STEM education. I had a talk with my 11-year-old last week about possible questions that could be answered because we can tap into the high performance computing cluster from home and there are all sorts of enormous datasets, including census data. I suggested perhaps her class would like to come up with some questions. Julia made an A in math and did okay on her standardized tests (defined as not nearly as above average as I consider acceptable) and her lowest score was in ‘data interpretation’. Since I haven’t heard back from her teacher, Julia and I are going to hypothesize about such things as the number of 11-year-olds in the country and how many of them live in different regions, the average income, standard deviation of income, where she stands relative to that. Then, I am going to write a program to find all of the answers and run it on SAS 9.2 which we are still testing (no bugs so far) . Since I am still testing it and haven’t used the map library, this will be a nice thing for the university that I am working for free on Thanksgiving weekend and Julia’s knowledge of data interpretation and hypothesis testing will improve.
Whether she thinks it is cool or not.
Maybe I have been wrong.
It wouldn’t be the first time, in fact, most of the really great things in my life have come about when I realized I was on the wrong track and took a sharp right turn. (Uncharacteristically skipping the opportunity here to make snarky comment about my first boyfriend, job or marriage. )
Four incidents made me reconsider what matters in statistics.
The first two were related. I attended a really fascinating lecture on statistics where the speakers discussed pretty much what morons 99.99% of the world were, how assumptions are violated, variables are not normally distributed and as a result our standard errors are usually wrong, and not just a little bit, by a lot, and we should all be learning new statistical methods. Honestly, I found myself nodding my head in agreement at most of what they said, and I have to admit that I have been guilty of some of these same errors myself.
Shortly thereafter, I was at the SPSS Directions conference and I asked the equally fascinating Bill from SPSS about the research he was doing to predict shipments of contraband, violent incidents in Richmond and cyberattacks. He said much of what they were using was Decision Tree.
It dawned on me that both of these brilliant men were right. On the one hand, making the wrong assumptions can inflate your standard errors by a lot. On a practical level, with 100 or even 300 subjects this can make a substantive difference. However, if you have 12,000 or 12,000,000 records then yes, your standard error may actually be three times what you incorrectly have assumed, but if that means it is .00027 instead of .00009 – it isn’t going to make much difference. This harks back to that article in Wired a few years ago, on The End of Theory. The gist of it was that with the amount of data we have, who needs science. At the time, I spotted some immediate flaws in the argument, and similar ones since. If you don’t look even at some basic regression assumptions you may miss, for example, that your prediction is great overall but for your highest dollar customers, most violent offenders, or whatever your dependent variable, your error rate is much higher, and these are the exact people you most want to predict (i.e., you have violated the assumption of homescedasticity). I have to say, too, that the analogy with language translation was less than compelling, the article said that Google can translate from one language to the next without knowing the language. Well, not so well. I just typed a phrase in Spanish into Babelfish that I had said to my little daughter lying in the next bed in the hotel. It means, “You are such a beautiful little girl”. The translation came back, “That pretty girl.”
… and yet … it was close, it didn’t come back with “Hey, buddy, you want to buy a goat?”
It wouldn’t take too much effort, really, to take the 1,000 or 10,000 most common phrases in each language and enter those into the software and have these checked and THEN go on to the word by word translation.
The same with statistics. You can build in the diagnostics, as SAS has done with ODS GRAPHICS, for example.
I think some knowledge of statistics is crucial, but I am not so convinced that minor departures from normality, small correlations among variables or some heteroscedasticticity will damn us all to statistical hell when our datasets are approaching hundreds of gigabytes on a fairly regular basis. Yes, it will not be precisely correct. Yes, there IS danger to not understanding some of the basic underpinnings of statistics. The Chronicle of Higher Education forum has anonymous (more or less) users but some day I do hope to meet in person the person whose signature says,
“Being able to find SPSS in the start menu does not qualify you to run a multinomial logistic regression.”
Not only funny, but true. On the other hand, I don’t think you need to be able to calculate structural equations in your head to be qualified to design, conduct and interpret a statistical study.
The third incident included the same Bill- the-I-think-he-was- a-vice-president from SPSS. (A person better than me at sucking up would have found out his last name. I did make a faint effort on Google, yet again proving less than infallible as Bill appears to be a popular name for SPSS vice-presidents and I am writing this at 1:30 a.m. on the east coast as I have not yet adjusted to the time zone so am somewhat imprecise.) I ran into him after his presentation and asked him if he had published his results, as the improvements in prediction they had achieved were really quite remarkable. He said,
“Not in the sense that someone like you from a university means by publishing.”
He went on to say that he had presented at conferences like this one and discussed the results with customers but not published in an academic journal.
Finally, there was the SAS Tech Report, in which editor Waynette Tubbs mentioned about finding a job and networking, “Are you published?” but she meant having a blog, doing papers at places like WUSS (Western Users of SAS Software) and SAS Global Forum. This is very far from the definition of publishing that I was taught (nearly brainwashed) as almost all doctoral students at research institutions are, that peer-reviewed, academic journals were the gold standard, be-all, end-all and 90% of the measure of your worth as a human being.
So, is it just barely possible that having a very, very good understanding of statistics, albeit not to the point of tossing off pages of proofs off the top of your head, and writing about it in a comprehensible fashion is what really matters? To a regular person, this probably doesn’t sound too crazy, but to someone who has spent most of a lifetime in academia it borders on heresy.
I think that knock on the door is a group of inquisitors come to burn me at the stake.
On the other hand, it may just be room service.
Everybody is talking now about the acquisition of SPSS by IBM. So far, having just come back from the SPSS DIrections conference, I can’t say that I have seen a lot of differences yet. There seemed like there were a few more SPSS people at the conference, which is a good thing. In my experience, the sessions at the Directions conference are uneven, some are really informative – the one on how the Hamilton County, TN schools were using SPSS was great. Basically, everyone from the district to individual teacher level gets regular reports on student progress, down to the detail of whether the student performed worse on word problems or computation on math achievement and up to analyses showing the significant impact of being old-for-grade in elementary school on high school drop out years later. Other sessions were as boring as watching paint dry and it would have been more productive to have sat in the hallway and checked my email. HOWEVER, any chance you have to meet with the folks from SPSS is well-spent. I am totally bummed that I did not have the time to take Jon Peck up on his offer to demonstate some of the extensions.
WAY COOL # 1 : If you go to Directions don’t miss the chance to sign-up for the product experience sessions. You can schedule in advance a meeting with an SPSS expert on your topic of interest. I spent a half-hour with Nancy Morrison learning about the uses of the Missing Values analysis add-on, including pattern analysis and multiple imputation, and the Complex Samples add-on. To use the Complex Samples you need to create a plan file, but this is relatively easy to do. Complex samples allows you to estimate frequencies, means, regression equations and more with stratified samples, cluster samples, etc. and get the appropriate standard errors.
WAY COOL #2: I learned from Nancy that the algorithms for SPSS are included on the SPSS documentation. How did I miss this? I had read several of the books on that CD. If you lost it, like I did (cleanliness is not my best thing, and since I do not have a housekeeper who follows me to work the CD is probably under the tank containing my two frogs, Type I and Type II. ) you can find the algorithms here ftp://ftp.spss.com/pub/spss/statistics/spss/algorithms/ .
A good opportunity to learn more about SPSS are the See It in SPSS events. I’ve been to a few of these and they included examples of using SPSS in teaching (one really cool idea – the professor created a survey on Qualtrics and had students send it to five friends and complete it themselves. This yielded a sample of about 300 that the students analyzed. Undergraduates tend to be intensely interested in themselves – even more than most people – and it was a very creative assignment). Other presentations have included integration of R with SPSS and uses of PASW Modeler.
Way cool # 3: I heard about this at one of the See it in SPSS events and could not believe I had not heard of it before. The Faculty Pack is an amazingly amazing deal. It includes 13 or 14 add-ons (depending on if you have Mac or Windows). Even if, like me, you already have a few of those add-ons licensed through your university, there are sure to be some like Neural Networks, Complex Samples, Custom Tables, Conjoint, Bootstrapping, Missing Values analysis that you don’t have. If you bought these separately you’d pay over $10,000. The Faculty Pack costs $250 a year.
Okay, in case this is starting to sound like an SPSS ad, I have to mention the just plain stupid part. As I asked several people at Directions – what the hell is this with selling software by the byte? BOOT-STRAPPING is an add-on? COMPLEX SAMPLES are an add-on? Come on! Maybe this makes sense at a commercial organization that might have one type of analysis but any mid-sized or larger university is going to have a School of Business, Medicine, Social Work, and a range of others that use a range of applications. When we license SAS, we get every statistical analysis they sell, from survey methods (e.g., surveylogistic, surveyfreq) to structural equation models (calis, tcalis) to all of the ODS graphics, Tabulate, Enterprise Guide and more. When people ask me about using SPSS for some add-on we don’t license I often recommend they use SAS.
Few things at a large institution happen quickly. To add another piece of software to the labs for teaching students usually requires at least two or three levels of approval, and can take a year or more. Even purchasing it for one person requires over a $1,000 for each add-on, at least one level of approval, an internal requisition or check request. It’s easier to just use SAS or Stata that come with all the statistical analyses 95% of the people will ever need all bundled together.
The decision by SPSS to market its software like this is just plain stupid. An article by Michael Mitchell of UCLA on use of statistical software is consistent with what I have always observed, that is, people tend to use what they know, what they learned in graduate school, what their professors used, even when that isn’t always the best tool. One would think, then, that software manufacturers would realize that it behooves them to make it less of a pain in the ass for universities to order and use their software. At our university, we have 33,000 students, about 9,000 faculty and staff, we install the software, provide technical support and classes on how to use it. What the heck more could a company want in the way of ‘evangelists’ for the software? Why can’t SPSS sell its statistics package like everyone else?
Decades ago (yes, I am that old), when I was working on an IBM mainframe, I was very favorably impressed with their customer service. So, hopefully, some of that will spill over on to licensing SPSS.
So, I am loving my new computer – 12 GB RAM, dual quad-core and a terabyte of space. Mac OS 10.6 (Snow Leopard) plus four VMs with Vista 32, Vista 64, Office 32 and Office 64, plus two enormous monitors so I can run Unix programs on our cluster and monitor that while I am programming in the Windows environment or running SPSS on the Mac on the other side. Learning JMP keeps moving up my to-do list, so I got the SAS download manager for JMP, downloaded it to my computer and got this message:
JVM not found.
I suspected immediately it had something to do with Snow Leopard and the fact that I was right about this helped me not at all in fixing it. I could not find any mention of this problem on the JMP site or anything specific to the SAS download manager or JMP anywhere. However, I did find quite a number of explanations of why one might get a JVM not found error and what to do about it.
I found a number of sites with fixes, all relating to installing Java 1.5 on Snow Leopard, some of which were absolute Greek to me (and I have three graduate degrees, including a doctorate with a specialization in statistics where we are used to Greek! )
This one http://chxor.chxo.com/post/183013153/installing-java-1-5-on-snow-leopard from CHXO Internets was absolutely simple to follow and fixed my problem immediately. Hurray.
[Note to self, as soon as I get a spare minute, as in first thing tomorrow, send $20 for shareware fee to Pacifist , I already have another use for this. ]
When I followed the few steps outlined in the blog post above, the SAS download manager installed, then I installed JMP and all was wonderful.