### Jan

#### 21

# Completely Random

January 21, 2009 | 3 Comments

I hate SQL. This is probably completely irrational, like that guy I turned down for a date in junior high school who my mom always tells me founded a very successful company and is making piles of money. No wait, it wasn’t irrational, he always tried to copy off me in Algebra, plus he was just plain boring. I think that is my problem with SQL, too, boredom. There is only so much left join, right join, outer, inner and dataset.variable I can tolerate before my brain tries to escape through my right ear just to get away from monotony. I have met people who love SAS. I have meet people who love SPSS. I have even met people who love Stata. Nobody loves SQL. They are just with it for the money.

What practical use is Mokken’s H, really? Yes, it is true that the maximum phi is determined by the marginal distributions, and if you get a phi of .20, for certain distributions, that might be the maximum you can get, but so what? Maybe I was scarred in my youth by reading some of the articles on bias in mental testing where those who were so determined to prove that intelligence was genetic corrected correlations for attenuation, sometimes to as high as 1.20 and then averaged the corrected correlations!

From a purely theoretical standpoint now, it’s completely different. If you are interested in the analysis of binary data – and how could you not be – you’ll like this paper by David Armstrong, at the University of Oxford. I like it because he is very sensible. He doesn’t take a stance like “You should never use phi, never analyze bivariate data in a factor analysis, ” etc. He takes a very measured view, which I like because really, so few things in the world are always true, except brain-dead obvious facts like you should not correct correlations to be above 1.0 ! (Clearly, I have still not gotten over that.) I have several SPSS workshops coming up. I think I will import the data from our evaluation of after-school programs to illustrate just how much the phi and tetrachoric coefficients move around when the marginal distributions change a lot. It’s a tough job, but somebody has to do it.

I can’t see a lot of people who are experienced SAS programmers switching to Enterprise Guide. Who I can see using it is people who use SQL, ACCESS, Excel or who are just starting to use statistics in their education or profession.

**Hello, my name is Catain Obvious**…. All that Data Step stuff you were missing and could not find in SAS Enterprise Guide? It was cleverly hidden in the menu under the word DATA.

*“Must be a new meaning of the word ‘filter’ with which I was previously unfamiliar.”*

Okay, maybe not so obvious is the fact that you need to go under the **Data** menu to **Filter and Query** to add two datasets together. I thought filter meant to hold back certain elements. Oh well, I guess it makes as much sense as going to the start menu to shut down your computer.

So, if you want to compute variables, recode variables, add tables or join tables, go to **Data > Filter and Query**.

Did I mention that I hate SQL ?

### Jan

#### 12

# SAS was good today and so was the weather

January 12, 2009 | 3 Comments

It was 85 degrees in soCal today. Just thought I would rub that in for the benefit of my former and present colleagues in the frozen north. The advantage of having a blog as opposed to doing an on-line course is that I can just randomly switch subjects, which in my office is referred to as “not being corporate”. Don’t know whether it is corporate or not but I can see that a lot of people are going to be loving SAS Enterprise Guide.

I have to confess that many years ago when Enterprise Guide first came out I thought it was one of the stupidest ideas I had ever heard. It was horrendously slow and about as intuitive as building a nuclear reactor out of wood. Have times ever changed! It is still slow and if you look at the log, the code it writes certainly isn’t what I would do. However, you can open SAS datasets with no problem. People who have no idea what a permanent library or LIBNAME statement is can now use SAS to make graphs, do principal components analysis and analyze subsets of their data. Imagine that Excel and Access had a baby, who then grew up and married the love child of SAS and some really cool graphics program that was not SAS/ Graph and didn’t suck. That, in a nutshell is SAS Enterprise Guide. It is full of surprises and almost all of them are pleasant. For example, today, I used the **Send to** on the **File Menu** to send a file to Microsoft Word and, surprisingly, it opened in Office 2007. Everything I had read said SAS and Office 2007 were not yet compatible, but that is not the case apparently. It has been hard for me to let go of coding everything because it is such a habit after twenty-six years, but I am trying very hard to put myself in the position of the people who will be taking the Enterprise Guide workshop next month and realize that, to them, typing:

*Libname in “c:\amsasex\project7\aimee” ;*

*data in.disability_study ;*

set in.fullsample ;

if “disability_status = “Y” ;

is NOT the easy way. So, I have set myself the challenge of trying to use only Enterprise Guide to solve problems and not doing any programming. I have not succeeded at all, yet, by the way, but I am making progress. For example, I am starting to use the **Filter and Query** option from the data menu instead of those subsetting IF statements. It actually works just fine. In another post, I had talked about how people continue to use chi-square and ordinary least squares regression even when those are not appropriate at all for their data because they are familiar. I know I am in the same boat. Several times today, I exited Enterprise Guide, wrote the code in SAS 9.2 and ran it because I want to be able to look at my log and see what it does. Yes, you can look at the log in Enterprise Guide but the way the code is written is definitely not how I would have done it. In reality, the vast majority of people are very comfortable not knowing what goes on under the hood. How many people who use Word (including me) have the foggiest notion what the code looks like? Enterprise Guide can be a force for good or evil. It can allow researchers and executives more time to focus on how the sample was selected, the selection of the appropriate statistic and correct interpretation of the results. And it can be used by management-weenies and pointy-head boss wanna-bes to print out pretty pictures and tables with lots of numbers that they only pretend to understand.

My prediction, based on a random sample of zero, is that there will be a lot of both.

### Jan

#### 7

# The future has been here for a while now

January 7, 2009 | Leave a Comment

*We interrupt this discussion of chi-square and other categorical data analysis for a broadcast from the future …*

My brain is full. After three days of Macworld I have come to a number of conclusions, which I will share with you in random order, thus sparing you the expense, inconvenience and sogginess of traveling to San Francisco.

Any educational institution that is not getting into podcasting BIG TIME is missing the boat. It has already left the shore and you had better start swimming.

The last time I took my daughter to Macworld, high on her Christmas list was the preschool game, Pajama Sam. The iPhone and iTouch were not even a twinkle in Steve Jobs’ eye. My now ten-year-old daughter received an iTouch for Christmas and by the next day had three pages of apps on it, thanks to the allowance in the iTunes store that Daddy gave her. If you believe the Macworld folks, and I have no reason not to, there are 10,000 apps for the iPhone.

I spend most of my time using, reading and teaching about statistics and statistical software. I do some web pages, in both Dreamweaver 8 and html, I write one blog using Word Press and a second with Movable Type, still most of my focus on computer usage is just that, usage. I use SAS on Windows, Solaris and Linux OS, I use SPSS , Stata, Dreamweaver and Photoshop on both Mac and Windows. For all of that, I would consider myself a very knowledgeable computer USER and a good SAS programmer. I wouldn’t consider myself an expert on computers any more than I would say I am tour guide just because I have driven most of the roads in Los Angeles.

What became abundantly clear once I looked up from my statistics books is that the world IS changing. To a greater extent than I had believed, blogs ARE for old people. That isn’t to say that young people don’t read blogs, they do. I can guarantee that younger people are spending a LOWER proportion of time on blogs and surfing the Internet on computers and a HIGHER proportion of their time using their iPods, iPhones, iTouch and other devices. They are also spending a greater proportion of their time PERIOD on-line in some form or another.

Now, none of this is breaking news. At the Center for Scholarly Technology conference several months ago, there were a number of presentations on Digital Natives, Lance Wicks has been writing about it for quite some time as well as many others. Here is what has NOT happened. By and large, educational institutions have been slow to get with the program.

Unless Second Life can be efficiently played on an iPod (which I doubt), the Linden dollar is going the way of the Dow Jones.

The improved iLife with Garageband and iMovie are worth the money. I attended a lab on Garageband on Monday. It was worthwhile. I learned a few new things, which is all one can reasonably expect in two hours.

We have Blackboard, which I, and many other faculty members, use pretty regularly. Yet, how many of us create podcasts? Have we not NOTICED that every one of our students has an iPod? I bet the average student goes to iTunes and YouTube more often than to the library.

Something’s got to give as far as Internet access. I know I am going to get flak about this because my own university, in fact, my own department disables accounts for using excess bandwidth, but we need to acknowledge the fact that people are more and more using the Internet for downloading music, videos and software. Today, I had to call Guest Tech at the Holiday Inn because the Internet quit working. I was told that my IP address had been disabled because of the amount of downloads from the hotel. The helpful young man on the phone commented that often that came about because people were downloading movies and music and maybe I should not do that.

Actually, I am not running a middle-aged version of Napster from my hotel room and I doubt my fellow travelers are as well. Many of the guests are here for Macworld. Last night, I tried (unsuccessfully) to download a movie illustrating the new features of Mobile Me, by Apple. I also wanted to check some iPhone apps that I had read reviewed in the new magazine, iPhone Life.

This morning, I am trying to download some podcasts from PBS plus check some iTunes U offerings. It has become painfully obvious to me that we are behind in educational technology everywhere I have ever taught. It is as if the faculty need to sprint to get ahead of the students we are supposed to be leading. I am not saying that as professors we don’t have more knowledge about our field, because we do. We also have very bright, gracious, motivated students in most cases so they are too polite to point out that we are teaching them Calculus using the modern equivalent of a piece slate and a sharp rock.

It is not that some of the places I have taught have not used podcasts, it is just that the ones I have seen have generally (not all) been dismal, nothing more than a droning voice reminiscent of the teacher in the old Charlie Brown specials – blah blah blah blah blah.

Well, I need to head back to LA. On the way, I am going to listen to the podcasts and iTunes U to try to pick up some ideas. After all, I can use the iPod on my iPhone on the 8-hour drive home. That’s one more advantage to podcasts over the Internet.

Oh, note to self, the next time you refer to the kids as “this MTV generation”, you’re dating yourself. MTV has been replaced by iTunes, Youtube, Facebook (MySpace is so-o-o 2007.)

### Jan

#### 3

# Phi coefficients, Christmas and the number 42

January 3, 2009 | Leave a Comment

People like familiarity. That’s probably one reason we enjoy the holidays so much – we know all the words to Silent Night, how to carve a turkey, which of the Christmas cookies taste the best. If I am going to convince you to give up statistics with which you feel comfortable, such as chi-square and correlation coefficients to dive into logit models, I probably owe you an explanation as to why that might be desirable and even necessary. Kind of like convincing you that turkey will give you food poisoning or that Christmas ornament was made by shoeless, small children working for pennies a day down in Guatemala.

*What could possibly be wrong with phi coefficients?*

Phi coefficients are interpreted similarly to the Pearson coefficient which all of us learned in our first statistics course, along with having the mantra “Never infer causation from correlation” beaten into our heads.

Correlation coefficients are comfortable.

If you liked Hitchhikers Guide to the Galaxy (and how could you possibly not?) you might remember when Arthur Dent found out that 42, the answer to the ultimate question Life the Universe, and Everything, was the answer to the question “What is six times nine?” he responded,

“*I always thought something was fundamentally wrong with the universe.”*

**Facts about the phi coefficient you maybe never knew or gave much thought**

Here is the problem with the phi coefficient – the statement that phi is interpreted similarly to the Pearson correlation coefficient is based on the assumption that the marginal distributions of the variables are equal,for example, it is assumed for dichotomous variables that 50% of the population falls in each category.

Now we all know that restricted variance has a negative impact on the size of the correlation, the well-known ‘restriction of range’. If you are reading this blog, there is a high probability that you know the formula for the population variance as the sum of every x minus the mean of x, squared, and divided by n. If every value is close to the mean, you have relatively little variance.

Think about this for a moment – the variance of a binary variable = p*q where p is the probability of a given event, say getting a coherent answer from me before 7 a.m., and q is the probability of the opposite of that event. If you don’t remember some of these formulae or just don’t believe me, here is a nice site from the Visual Statistics Studio to remind you.

So… the maximum possible variance is when the odds of p versus q are 50-50 and the variance = .25. (Believe me, the odds of getting a coherent response from me before 7 a.m. are NOT .50 !) When we have a distribution that departs substantially from 50-50, say 95% of our patients lived and only 5% died, we have a restriction of range.

So, problem number one:

The phi coefficient’s maximum value is NOT always 1.0. Here is something you probably always intuitively knew but never gave much thought about,

When the split is fairly extreme on one of the variables and not so extreme on the other, the maximum phi coefficient is considerably less than 1.0 . In fact, if it is a 90-10 split on one variable, say, the probability of mortality from a given disease, and 50-50 on the other, for example that a patient was in the treatment or control group, the maximum possible phi value is only .33 . How depressing is that? For a really interesting discussion of this whole topic of marginal probabilities and maximum phi coefficients, I recommend the book **Practical meta-analysis**, by Lipsey & Wilson. And no, I have never met them and do not receive a cut from any books they sell. However, if they are reading this, they are most welcome to show their gratitude via Starbucks gift cards.

It is because of this, and other problems with the phi coefficient that other statistics and models are preferable. Enter, for example, the odds ratio. The odds ratio does not range from -1 to +1 , even in theory. It ranges from zero to infinity.

Odds ratios are good for studying diseases. Personally, I believe coffee cures all ills and particularly reduces your stress in the morning. Let’s say in a sample of 100 non-coffee drinkers that 50 of them get heart disease and of 100 coffee drinkers, 25 of them get heart disease. The odds ratio then, is 3.00 . The odds of getting heart disease versus not for a non-coffee drinker are even, 1:1, while the odds for a coffee drinker are 3:1 against, so the odds ratio is 3.0.

Incidentally, these data are completely made up as an example and you should not take this post as evidence to either begin drinking coffee or invest in shares of Maxwell House. On the other hand, if you need to be told that, you probably should not be allowed to have your own checkbook, anyway.

The odds ratio itself also has problems. One problem is that while it would be very nice to have a standard error for the odds ratio, the odds ratio is skewed, remember that whole zero to infinity thing?

A standard error, confidence intervals, that whole thing, assumes a random, normal distribution of errors. Enter the natural logarithm of the odds ratio – but that is a post for another day, after three cups of coffee.

## Blogroll

- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog