### Jan

#### 15

# Descriptives, Details and Death

Filed Under Software, statistics | Leave a Comment

I think descriptive statistics are under-rated. One reason I like Leon Gordis’ Epidemiology book is that he agrees with me. He says that sometimes statistics pass the “inter-ocular test”. That is, they hit you right between the eyes.

I’m a big fan of eye-balling statistics and SAS/GRAPH is good for that. Let’s take this example. It is fairly well established that women have a longer life span than men in the United States. In other words, men die at a younger age. Is that true of all causes?

To answer that question, I used a subset of the Framingham Heart Study and looked at two major causes of death, coronary heart disease and cancer. The first thing I did was round the age at death into five year intervals to smooth out some of the fluctuations from year to year.

data test2 ;

set sashelp.heart ;

ageatdeath5 = round(ageatdeath,5) ;

proc freq data=test2 noprint;

tables sex*ageatdeath5*deathcause / missing out= test3 ;

/* NOTE THAT THE MISSING OPTION IS IMPORTANT */

**THE DEVIL IS IN THE DETAILS**

Then I did a frequency distribution by sex, age at death and cause of death. Notice that I used the missing option. That is super-important. Without it, instead of getting what percentage of the entire population died of a specific cause at a certain age, I would get a percentage of those who died. However, as with many studies of survival, life expectancy, etc. a substantial proportion of the people were still alive at the time data were being collected. So, percentage of the population, and percentage of people who died were quite different numbers. I used the NOPRINT option on the PROC FREQ statement simply because I had no need to print out a long, convoluted frequency table I wasn’t going to use.

I used the OUT = option to output the frequency distribution to a dataset that I could use for graphing.

**More details:** The symbol statements just make the graphs easier to read by putting an asterisk at each data point and by joining the points together. I have very bad eyesight so anything I can do to make graphics more readable, I try to do.

symbol1 value = star ;

symbol1 interpol = join ;

Here I am just sorting the data set by cause of death and only keeping those with Cancer or Coronary Heart Disease.

proc sort data=test3;

by deathcause ;

where deathcause in (“Cancer”,”Coronary Heart Disease”);

**Even more details. ** You always want to have the axes the same on your charts or you can’t really compare them. That is what the UNIFORM option in the PROC GPLOT statement does. The PLOT statement requests a plot of percent who died at each age by sex. The LABEL statement just gives reasonable labels to my variables.

proc gplot data = test3 uniform;

plot percent*ageatdeath5 = sex ;

by deathcause ;

Label percent = “%”

ageatdeath5 = “Age at Death” ;

When you look at these graphs, even if your eyes are as bad as mine you can see a few things. The top chart is of cancer and you can conclude a couple of things right away.

- There is not nearly the discrepancy in the death rates of men and women for cancer as there is for heart disease.
- Men are much more likely to die of heart disease than women at every age up until 80 years old. After that, I suspect that the percentage of men dying off has declined relative to women because a very large proportion of the men are already dead.

So, the answer to my question is “No.”

### Dec

#### 31

# My Year in Books: Technical Edition

Filed Under Dr. De Mars General Life Ramblings, Software, statistics | Leave a Comment

I read a lot. This year, I finished 308 books on my Kindle app, another dozen on iBooks, a half-dozen on ebrary and 15 or 20 around the house. I don’t read books on paper very often any more. It’s not too practical for me. I go through them at the rate of about a book a night, thanks to a very successful speed reading program when I was young (thank you, St. Mary’s Elementary School). Don’t be too impressed. I don’t watch TV and I read a lot of what I colleague called, “Junk food for the brain”. I read a bunch of Agatha Christie novels, three Skullduggery Pleasant books, several of the Percy Jackson and the Olympian books. Yes, when it comes to my fiction reading, I have the interests of a fourteen-year-old girl. Trying to read like a grown up, I also read a bunch of New York Times bestseller novels and didn’t like any of them.

So, I decided to do my own “best books list” based on a random sample of one, me, and make up my own categories.

Because I taught a course on multivariate statistics, I read a lot of books in that area, and while several of them were okay, there was only one that I really liked.

**The winner for best statistics book I read this year, **

*Applied logistic regression, 3rd Edition,* by David Hosmer, Stanley Lemmeshow and Rodney Sturdivant.

I really liked this book. I’m not new to logistic regression, but I’m always looking for new ideas, new ways to teach, and this book was chock full of them. What I liked most about it is that they used examples with real data, e.g., when discussing multinomial logistic regression, the dependent variable was type of placement for adolescents, and one of the predictor variables was how likely the youthful offender was to engage in violence against others. It is a very technical book and if you are put off by matrix multiplication and odds ratios, this isn’t the book for you. On the other hand, if you want any in depth understanding of logistic regression from a practical point of view, read it from the beginning to end.

**Best SAS book I read this year …**

Let me start with the caveat that I have been using SAS for over 30 years and I don’t teach undergraduates, so I have not read any basic books at all. I read a lot of books on a range of advanced topics and most of them I found to be just – meh. Maybe it is because I had read all the good books previously and so the only ones I had left unread lying around were just so-so. All that being said, the winner is …

*Applied statistics and the SAS programming language (5th Ed)*, by Ronald Cody and Jeffrey Smith

This book has been around for eight years and I had actually read parts of it a couple of years ago, but this was the first time I read through the whole book. It’s a very readable intermediate book. Very little mathematics is included. It’s all about how to write SAS code to produce a factor analysis, repeated measures ANOVA, etc. It has a lot of random stuff thrown in, like a review of functions, and working with date data. If you have a linear style of learning and teaching, you might hate that. Personally, I liked that about it. This book was published eight years ago, which is an eon in programming time, but a chi-square or ANOVA have been around 100 years, so that wasn’t an issue. While I don’t generally like the use of simulated data for problems in statistics, for teaching this was really helpful because when students were first exposed to a new concept they didn’t need to get a codebook, fix the data. For the purpose of teaching applied statistics, it’s a good book.

**Best Javascript programming book I read this year**

I read a lot of Javascript books and found many of them interesting and useful, so this was a hard choice.

*The jQuery cookbook, edited by Cody Lindley*

was my favorite. If you haven’t gathered by now, I’m fond of learning by example, and this book is pretty much nothing but elaborate examples along the lines of , “Say you wanted to make every other row in a table green”. There are some like that I can imagine wanting to do and others I cannot think of any need to use ever. However, those are famous last words. When I was in high school, I couldn’t imagine I would ever use the matrix algebra we were learning.

**Best game programming book I read this year**

Again, I read a lot of game programming books. I didn’t read a lot of mediocre game programming books. They all were either pretty good or sucked. The best of the good ones was difficult to choose, but I did

*Building HTML5 Games by Jesse Freeman*

This is a very hands-on approach to building 2-D games with impact, with, you guessed it, plenty of examples. I was excited to learn that he has several other books out. I’m going to read all of them next year.

So, there you have it …. my favorite technical books that I read this year. Feel free to make suggestions for what I ought to read next year.

### Dec

#### 18

*(There may even be a part two, if I get around to it.)*

Let me ask you a couple of questions:

1. Do you have more than just one dependent variable and one independent variable?

2. If you said, yes, do you have a CATEGORICAL or ORDINAL dependent variable? If so, use logistic regression. I have written several posts on it. You can find a list of them here. Some involve Euclid, marriage, SAS and SPSS. Alas, none involve a naked mole rat. I shall have to remedy that.

3. You said yes to #1, multiple variables, but no to number 2, so I am assuming you have multiple variables in your design and your dependent variable is interval or continuous, something like sales for the month of December, average annual temperature or IQ. The next question is do you have only ONE dependent variable and is it measured only ONCE per observation? For example, you have measured average annual temperature of each city in 2013 or sales in December , 2012. In this case, you would do either Analysis of Variance or multiple regression. It doesn’t matter much which you do if you code it correctly. Both are specific cases of the general linear model and will give you the same result. You may also want to do a general linear MIXED model, where you have city as a random effect and something else, say, whether the administration was Democratic or Republican as a fixed effect. In this case I assume that you have sales as your dependent variable because contrary to the beliefs of some extremists, political parties do not determine the weather. Generally, whether you use a mixed model or an Ordinary Least Squares (OLS) plain vanilla ANOVA or regression will not have a dramatic impact on your results unless the result is a grade in a course where the professor REALLY wants you to show that you know that school is a random effect when comparing curricula.

4. Still here? I’m guessing you have one of two other common designs. That is, you have measured the same subjects, stores, cities, whatever, more than once. Most commonly, it is the good old pretest posttest design and you have an experimental and control group. You want to know if it works. If you have only tested your people twice, you are perfectly fine with a repeated measures ANOVA. If you have tested them more than twice, you are very likely to have grossly violated the assumption of compound symmetry and I would recommend a mixed model.

5. All righty then, you DO have multiple variables, they are NOT categorical or ordinal, your dependent variable is NOT repeated, so you must have multiple dependent variables. In that case, you would do a multivariate Analysis of Variance.

Some might argue that logistic regression is not a multivariate design. Other people would argue with them that, assuming your data are multinomial, you need multiple logit functions so that really is a type of multivariate design. A third group of people would say it is multivariate in the ordinal or multinomial case because there are multiple possible outcomes.

Personally, I wonder about all of those types of people. I wonder about the amount of time in higher education spent in forcing students to learn answers to questions that have no real use or purpose as far as I can see.

On the other hand, while knowing whether something falls in the multivariate category or not probably won’t impact your life or analyses, if you treat time as an independent variable and analyze your repeated measures ANOVA with experiment and condition as a 2 x 2 ANOVA, you’re screwed.

Know your research designs.

### Dec

#### 2

What if you wanted to turn your PROC MIXED into a repeated measures ANOVA using PROC GLM. Why would you want to do this? Well, I don’t know why you would want to do it but I wanted to do it because I wanted to demonstrate for my class that both give you the same fixed effects F value and significance.

I started out with the Statin dataset from the Cody and Smith textbook. In this data set, each subject has three records,one each for drugs A, B and C. To do a mixed model with subject as a random effect and drug as a fixed effect, you would code it as so. Remember to include both the subject variable and your fixed effect in the CLASS statement.

Proc mixed data = statin ;

class subj drug ;

model ldl = drug ;

random subj ;

To do a repeated measures ANOVA with PROC GLM you need three variables for each subject, not three records.

**First, create three data sets** for Drug A, Drug B and Drug C.

Data one two three ;

set statin ;

if drug = ‘A’ then output one ;

else if drug = ‘B’ then output two ;

else if drug = ‘C’ then output three ;

**Second, sort these datasets and as you read in each one, rename LDL** to a new name so that when you merge the datasets you have three different names. Yes, I really only needed to rename two of them, but I figured it was just neater this way.

proc sort data = one (rename= (ldl =ldla)) ;

by subj ;

proc sort data= two (rename = (ldl = ldlb)) ;

by subj ;

proc sort data=three (rename =(ldl = ldlc)) ;

by subj ;

**Third, merge the three datasets by subject.**

data mrg ;

merge one two three ;

by subj ;

Fourth, run your repeated measures ANOVA .

Your three times measuring LDL are the dependent . It seems weird to not have an independent on the other side of the equation, but that’s the way it is. In your REPEATED statement you give a name for the repeated variable and the number of levels. I used “drug” here to be consistent but actually, this could be any name at all. I could have used “frog” or “rutabaga” instead and it would have worked just as well.

proc glm data = mrg ;

model ldla ldlb ldlc = /nouni ;

repeated drug 3 (1 2 3) ;

run ;

Now you can be happy.

### Nov

#### 19

# The most common error new SAS users make

Filed Under Software, statistics, Technology | 1 Comment

Any time you learn anything new it can be intimidating. That is true of programming as well as anything else. It may be even more true of using statistical software because you combine the uneasiness many people have about learning statistics with learning a new language.

To a statistician, this error message makes perfect sense:

`ERROR: Variable BP_Status in list does not match type prescribed for this list.`

but to someone new to both statistics and SAS it may be clear as mud.

Here is your problem.

The procedure you are using, PROC UNIVARIATE , PROC MEANS is designed ONLY for numeric variables. You have tried to use it for a categorical variable.

This error means you’ve used a categorical variable in a list where only numeric variables are expected. For example, bp_status is “High”, “Normal” and “Optimal”

You cannot find the mean or standard deviation of words, so your procedure has an error.

So … what do you do if you need descriptive statistics?

Go back to your PROC UNIVARIATE or PROC MEANS and delete the offending variables. Re-run it with only numeric variables.

For your categorical variables, use a PROC FREQ for a frequency distribution and/ or PROC GCHART.

Problem solved.

You’re welcome.

### Nov

#### 13

# Doing your statistics homework with SAS – confidence intervals

Filed Under Software, statistics | Leave a Comment

Computing confidence intervals is one of the areas where beginning statistics students have the most trouble. It is not as difficult if you break it down into steps, and if you use SAS or other statistical software.

Here are the steps:

1. Compute the statistic of interest– that is mean, proportion, difference between means

2. Compute the standard error of the statistic

3. Obtain critical value. Do you have 30 or more in your sample and are you interested in the 95% confidence interval?

- If yes, multiply standard error by 1.96
- If no (fewer people), look up t-value for your sample size for .95
- If no (different alpha level) look up z-value for your alpha level
- If no (different alpha level AND less than 30) look up the t-value for your alpha level.

4. Multiply the critical value you obtained in step #3 by the standard error you obtained in #2

5. Subtract the result you obtained in step #4 from the statistic you obtained in #1 . That is your lower confidence limit.

6. Add the result you obtained in step #4 to the statistic you obtained in #1. That is your upper confidence limit.

**Simplifying it with SAS**

Here is a homework problem:

The following data are collected as part of a study of coffee consumption among undergraduate students. The following reflect cups per day consumed:

3 4 6 8 2 1 0 2

A. Compute the sample mean.

B. Compute the sample standard deviation.

C. Construct a 95% confidence interval

I did this in SAS as so

data coffee ;

input cups ;

datalines ;

3

4

6

8

2

1

0

2

;

proc means mean std stderr;

var cups ;

I get the follow results.

Analysis Variable : cups | ||
---|---|---|

Mean | Std Dev | Std Error |

3.2500000 | 2.6592158 | 0.9401748 |

These results give me A and B. Now, all I need to do to compute C is find the correct critical value. I look it up and find that it is 2.365

3.25 – 2.365 * .94 = 1.03

3.25 + 2.365 * .94 = 5.47

That is my confidence interval (1.03, 5.47)

=========================

If you want to verify it, or just don’t want to do any computations at all, you can do this

Proc means clm mean stddev ;

var cups ;

You will end up with the same confidence intervals.

Prediction: At least one person who reads this won’t believe me, will run the analysis and be surprised when I am right.

### Nov

#### 10

# Probability and Mixed Martial Arts Decisions

Filed Under statistics | 2 Comments

A recent tweet about mixed martial arts decisions set me to thinking about probability. @Fight_ghost tweeted that a TV commentator made no sense when she said that she thought a fighter should have won by split, not unanimous decision. Others on twitter agreed with him that was a stupid comment, and asked did she think judges should say the other fighter only 2/3 won or what.

I thought it did make sense in statistical terms. Think of it this way:

The “true score” of the population in this case is the mean of what an infinite number of judges would rate a fighter’s performance. Of course, there is going to be variation around that mean. Some judges may tend to weight take downs a tiny bit more. Judges vary in their definition of a significant strike. Some judges are just going to be clueless or inattentive and give a score that is far from accurate. On the average, though, these balance out and the mean of all of those infinite judges’ scores should be the true score. Let’s say our fighter, Bob, had a true score of 27. The most common score we should see a judge give him is 27, but a 26 or 28 would not be totally unexpected. Given that the standard deviation of fight scores is low, we would be surprised to see him given a score of 25 or 29 and completely floored if he received a 24 or a 30.

Let’s say we have a second fighter, Fred. His true score is 29. The most common score we should see for him is a 29, but again, a 28 or a 30 would not be unexpected because there is variation in our sample of judges.

Here is the point … when fighters are far apart in the true score of their performance, judges should very seldom have a difference of opinion in who won. Even when Bob is scored high, for him, at 28 and Fred is scored his average of 29, Fred still wins. Let’s say the standard deviation of judge’s scores is 1. I believe it is really lower than that and I do know that the winner of a round has to get 10 points, but for ease of computation, just go with me.

For Bob to win, he must be rated at least two standard deviations above his true score (which occurs 2.5% of the time) and Fred must be rated below his true score, which occurs half the time. Since the scores for Bob and Fred are independent probabilities the probability of BOTH of these events happening is .025 x .5 = .0125

The other way for Bob to win is if Fred scores two standard deviations below his true score, which will occur 2.5% of the time AND for Bob to score above his true score. Again, the combined probability is .0125. SO …. only 2.5% of the time (.0125 + .0125) would Bob win. Since judges’ scores are independent, the probability of any one scoring it for him, causing a split decision is .025 + .025 + .025 = 7.5%

(If all **three** judges scored it for Bob, that would be a very, very low probability of .o25 * .025 * .025 because, again, the judges scores are assumed independent of one another. In only 0.063% of the cases would this occur. We should probably subtract that and the probability of two of them scoring it for Bob to be exact, but I have to finish grading papers tonight so we’ll just acknowledge that it is not exactly 7.5% and move on.)

Let’s go back to the fight that actually happened. I didn’t see it so I am going to take some people’s word that it was a close fight. They might be lying but let’s assume not.

In this case, Bob, who has a true score of 27, is not fighting Fred, but rather, Ignatz, who has a true score of 27.3 (with three judges, he’d get a 27, 27, 28 score). There is great overlap in Bob and Ignatz’s scores. To outscore Ignatz’s average score, Bob would need a score of 27.4 – well, a z-score of .4 occurs about 35% of the time. Half of the time Ignatz is going to score 27.3 or lower so the probability of him both having an average or below score AND Bob having a 27.4 or high score is .5 *.35 or .175. So 17.5% of the time, a judge would give Bob a higher score. Since there are three judges, the probability of ONE of them giving him a higher score would be .175 + .175 + .175 = 52.5%

There is also the small probability that it could go unanimous the other way, but that’s not really pertinent to our argument.

The point is simply this … if two fighters’ true scores are close, it is much less likely that you will see a unanimous decision than if their true scores are really far apart. The closer they are, the more that statement holds. So, no, it is not a stupid comment to say that you believe someone warranted a split decision rather than a unanimous decision. It may simply mean that you think the fighters’ were so close that you were surprised there was not any variance in favor of the only slightly better fighter.

Really, I think most people would find that a reasonable statement.

**Extra credit points: **

Give one reason why the Central Limit Theorem does not apply in the above scenario.

Answer this question:

Does the fact that the distribution of errors is necessarily non-symmetric in Fred’s case (cannot score above 30) negate the application of the Central Limit Theorem?

### Nov

#### 1

# How much statistics can you learn in an online course in a month?

Filed Under Dr. De Mars General Life Ramblings, statistics | Leave a Comment

Last year, I went from teaching in classrooms in a pretty building with a library on the ground floor to teaching on-line. I also went from the semester system to teaching the same content in four weeks. This has led curious friends of mine, used to teaching in the traditional format, to ask ,

How does that work? Does it work?

Initially, I was skeptical myself. I thought if students were really serious they would make the sacrifices to take the class in a “regular” setting. Interestingly, I had to take a class on a new system and had the option to sign up for a session held on a local campus or on-line. After looking at my schedule, I chose the on-line option. No one has *ever* accused me of being a slacker – in fact, it may be the only negative thing I’ve not been called. Still, I thought it was possible I might have conflicts those days, whether meeting with clients, employees or investors. The option of taking the course in smaller bits – an hour here or there – was a lot more convenient for me than several hours at a time. To be truthful, too, I didn’t really want to spend hours hanging out with people with whom I didn’t expect I would have that much in common. It wasn’t like a class on statistics that I was really interested in.

So … if we are willing to accept that students who sign up for on-line, limited-term classes might be just as motivated and hard-working as anyone else, do they work? I think the better question is how they work or for what type of students they work.

National University, where I teach, offers courses in a one course one month format. Students are not supposed to take more than one course at a time and , although exceptions can be made, I advise against it. The courses work for those students (and faculty) who can block off a month, and then, during that month DEVOTE A LOT OF TIME TO THE COURSE. Personally, I give two-hour lectures twice a week. If a student cannot attend – and some are in time zones where it is 2 a.m. when I’m teaching – the lectures are recorded and they can listen to them at their leisure. Time so far – 16 hours in the month. Normally, a graduate course I teach will require 50-100 pages of reading per week. Depending on your reading speed that could take you from one to four hours.

I just asked our Project Manager, Jessica, how long she thought it took the average person to read 75 pages of technical material she said,

“Whatever it is, I’m sure it’s a lot more than you are thinking!”

Talking it over, we agreed it probably took around 3-5 minutes per page, because even if some pages you get right away, others you have to read two or three times to figure out wait, that -1 next to a capital letter in bold means to take the inverse of a matrix while the single quote next to it means to transpose the matrix. These are things that are not second nature to you when you are just learning a field. Discussing this made me think I want to reduce the required reading in my multivariate statistics course. Let’s say on the low end, then it takes five hours to read the assigned material and review it for a test or just for your own clarity. Now we are up to 20 hours a month + 16 = 36 hours.

I give homework assignments because I am a big believer in distributed practice. We have all had classes we crammed for in college that we can’t remember a damn thing about. Okay, well, I have, any way. So, I give homework assignments every week, usually several problems like, “What is the cumulative incidence rate given the data in Table 2?” as well s assignments that require you to write a program, run it and interpret the results. I estimate these take students 4-5 hours per week. Let’s go on the low end and say 16 + 20 + 16 = 52 hours

There is also a final paper, a final exam and two quizzes. The final and quizzes are given 5 hours total and it is timed so students can’t go over. I think, based on simply page length, programs required and how often they call me, the average student spends 14 hours on the paper. Total hours for the course 52 hours plus another 19 = 71 hours in four weeks.

IF students put in that amount of time, they definitely pass the course with a respectable grade and probably learn enough that they will retain a useful amount of it. The kiss of death in a course like this is to put off the work. It is impossible to finish in a week.

My personal bias is that I require students actually DO things with the information they learn. It is not just memorizing formula and a lot of calculations because I really do think students will forget that after a few weeks. However, if they have to post a question that is a serious personal interest and then conduct a study to answer that question, the whole time posting progress and discussions on line with their classmates , then I think they WILL retain more of the material.

So, yes, students can learn online and they can learn in a compressed term. It IS harder, though, I think, both for the students and the instructor, and takes a lot of commitment on the part of both, which is why I don’t teach very many courses a year.

### Oct

#### 23

# Parallel Analysis Criterion Simplified?

Filed Under Software, statistics | Leave a Comment

Am I missing something here? All of the macros I have seen for the parallel analysis criterion for factor analysis look pretty complicated, but, unless I am missing something, it is a simple deal.

The presumption is this:

There isn’t a number like a t-value or F-value to use to test if an eigenvalue is significant. However, it makes sense that the eigenvalue should be larger than if you factor analyzed a set of random data.

Random data is, well, random, so it’s possible you might have gotten a really large or really small eigenvalue the one time you analyzed the random data. So, what you want to do is analyze a set of random data with the same number of variables and the same number of observations a whole bunch of times.

Horn, back in 1965, was proposing that the eigenvalue should be higher than the average of when you analyzed a set of random data. Now, people are suggesting it should be higher than 95% of the time you analyzed random data (which kind of makes sense to me).

Either way, it seems simple. Here is what I did and it seems right so I am not clear why other macros I see are much more complicated. Please chime in if you see what I’m missing.

- Randomly generate a set of random data with N variables and Y observations.
- Keep the eigenvalues.
- Repeat 500 times.
- Combine the 500 datasets (each will only have 1 record with N variables)
- Find the 95th percentile

%macro para(numvars,numreps) ;

%DO k = 1 %TO 500 ;

data A;

array nums {&numvars} a1- a&numvars ;

do i = 1 to &numreps;

do j = 1 to &numvars ;

nums{j} = rand(“Normal”) ;

if j < 2 then nums{j} = round(100*nums{j}) ;

else nums{j} = round(nums{j}) ;

end ;

drop i j ;

output;

end;

proc factor data= a outstat = a&k noprint;

var a1 – a&numvars ;

data a&k ;

set a&k ;

if trim(_type_) = “EIGENVAL” ;

%END ;

%mend ;

%para(30,1000) ;

data all ;

set a1-a500 ;

proc univariate data= all noprint ;

var a1 – a30 ;

output out = eigvals pctlpts = 95 pctlpre = pa1 – pa30;

*** You don’t need the transpose but I just find it easier to read ;

proc transpose data= eigvals out=eigsig ;

Title “95th Percentile of Eigenvalues ” ;

proc print data = eigsig ;

run ;

It runs fine and I have puzzled and puzzled over why a more complicated program would be necessary. I ran it 500 times with 1,000 observations and 30 variables and it took less than a minute on a remote desktop with 4GB RAM. Yes, I do see the possibility that if you had a much larger data set that you would want to optimize the speed in some way. Other than that, though, I can’t see why it needs to be any more complicated than this.

If you wanted to change the percentile, say, to 50, you would just change the 95 above. If you wanted to change the method from say, Principal Components Analysis (the default, with commonality of 1) to saying else, you could just do that in the PROC FACTOR step above.

The above assumes a normal distribution of your variables, but if that was not the case, you could change that in the RAND function above.

As I said, I am puzzled. Suggestions to my puzzlement welcome.

### Oct

#### 22

# Reports of The Death of Theory Have Been Greatly Exaggerated

Filed Under statistics | Leave a Comment

I’m about to tear my hair out. I’ve been reading this statistics textbook which shall remain nameless, ostensibly a book for graduate students in computer science, engineering and similar subjects. The presumption is that at the end of the course students will be able to compute and interpret a factor analysis, MANOVA and other multivariate statistics. The text spends 90% of the space discussing the mathematics in computing the results, 10% discussing the code to get these statistics and 0% discussing decisions one makes in selection of communality estimates, rotation, post hoc tests or anything else.

In short, the book is entirely devoted to explaining the part that the computer does for you that students will never need to do and 10% or less on the decisions and interpretation that they will spend their careers doing. One might argue that it is good to understand what is going on “under the hood” and I’m certainly not going to argue against that but there is a limit on how much can be taught in any one course and I would argue very strenuously that there needs to be a much greater emphasis on the part the computer cannot do for you.

There was an interesting article in Wired a few years ago on The End of Theory, saying that we now have immediate access to so much data that we can use “brute force”. We can throw the data into computers and “find patterns where science cannot.”

Um. Maybe not.

Let’s take an example I was working on today, from the California Health Interview Survey. There are 47,000+ subjects but it wouldn’t matter if there were 47 million. There are also over 500 variables measured on these 47,000 people. That’s over 23,000,000 pieces of data. Not huge by some standards, but not exactly chicken feed, either.

Let’s say that I want to do a factor analysis, which I do. By some theory – or whatever that word is we’re using instead of theory – I could just dump all of the variables into an analysis and magically factors would come out, if I did it often enough. So, I did that and came up with results that meant absolutely nothing because the whole premise was so stupid.

Here are a couple of problems

1. The CHIS includes a lot of different types of variables, sample weights, coding for race and ethnic categories, dozens of items on treatment of asthma, diabetes or heart disease, dozens more items on access to health care. Theoretically (or computationally, I guess the new word is), one could run an analysis and we would get factors of asthma treatment, health care access, etc. Well, except I don’t really see that the variables that are not on a numeric scale are going to be anything but noise. What the heck does racesex coded as 1= “Latin male”, 10 = “African American male” etc. ever load on as a factor?

2. LOTS of the variables are coded with -1 as inapplicable. For example, “Have you had an asthma attack in the last 12 months?”

-1 = Inapplicable

1 = Yes

2 = N0

While this may not be theory, these two problems do suggest that some knowledge of your data is essential.

Once you get results, how do you interpret them? Using the default minimum eigenvalue of 1 criterion (which if all you learned in school was how to factor analyze a matrix using a pencil and a pad of paper, I guess you’d use the defaults), you get 89 factors. Here is my scree plot.

I also got another 400+ pages of output that I won’t inflict on you.

What exactly is one supposed to do with 500 variables that load on 89 factors? Should we then factor analyze these factors to further reduce the dimensions? It would certainly be possible. All you’d need to do is output the factor scores on the 89 factors, and then do a factor analysis on that.

I would argue, though, and I would be right, that before you do any of that you need to actually put some thought into the selection of your variables and how they are coded.

Also, you should perhaps understand some of the implications of having variables measured on vastly different scales. As this handy page on item analysis points out,

“Bernstein (1988) states that the following simple examination should be mandatory: “When you have identified the salient items (variables) defining factors, compute the means and standard deviations of the items on each factor. If you find large differences in means, e.g., if you find one factor includes mostly items with high response levels, another with intermediate response levels, and a third with low response levels, there is strong reason to attribute the factors to statistical rather than to substantive bases” (p. 398).”

And hold that thought, because our analysis of the 517 or so variables provided a great example …. or would it be using some kind of theory to point that out? Stay tuned.