Daughter number 3The third of my four daughters was being questioned about her training before the last Olympics, and answered;

My mother was the first American to win the world championships, so I called her for advice, and believe me, Mom is always brimming with advice, whether you want it or not …



In fact, all parents have the experience that their own children occasionally take advice from strangers far better than from them. So, for your daughters or mine, here are three pieces of advice on succeeding in the tech world.

1. Learn Calculus – Ignore every person who tells you that you won’t need it, it’s too hard. Take it in high school and take it again in college.  People often say, “I just can’t do math.” That’s bull shit. You just can’t make the NBA. You can certainly do math. My youngest daughter whines that way sometimes and yet she doesn’t sit and read her Algebra book unless we stand over her and make her do it. Here is why you need to learn calculus:

2. Learn to say “Fuck you” and say it both openly (rarely) and to yourself (often).

My friend has a reputation for a great bedside manner. He uses a code phrase. When a patient says something like:

“I have decided to treat my cancer with grapefruit juice instead of chemotherapy.”

He responds,

“I understand how you can see it that way.”

This is his code for,

“You’re a fucking moron.”

You need a code phrase because people will try to dissuade you, denigrate you and generally provide useless advice (contrary to the wonderful advice I am giving you now). They will tell you that you cannot be an entrepreneur because you want to have a family. They’ll tell you that you are not a real ‘techie’ because you don’t have a degree in engineering. If you do have a degree in engineering it will be because you don’t have a degree in Computer Science. If you do have both degrees and have experience as an engineer and programmer it will be because you don’t know a specific programming language. Some people seem to have a sadistic desire to pull other people down, saying things like,

“You may have a masters degree but it’s not computer science from MIT. You don’t program in Ruby or Java and everyone knows that unless you have years of experience in both of those you are not really marketable.”

Feel free to tell those people, either:

“I understand how you can see it that way, but I’m going to go ahead and apply for the position at the accelerator anyway.”


“Fuck you! I’m going to do it anyway and I don’t care what you think.”

Seriously, there are very few insurmountable obstacles. One of my daughters received a Fulbright scholarship to study in Germany for a few weeks. She almost didn’t apply because she had a young child. I told her that she was being ridiculous, she had a husband, a mother, a mother-in-law and two adult sisters. Between the lot of us, we could take care of one baby.

So what if I let her teethe with Twizzlers during the week  I was there?

I also took her swimming in the hotel pool every day, to the science museum, to the aquarium and taught her to dance in elevators. And when Maria came back from Germany, her daughter was still alive, better than ever, because, hey, she had a couple more teeth.

This is really the most important piece of advice I have. Don’t let anyone discourage you and that includes yourself.

3. Learn a programming language or two.

If you followed my first two pieces of advice, this third one will be easier. The whole trick to learning a language is to not get discouraged and plug away at it. Read a book. Write some code. Read another book. Look at programs other people wrote. Think of some things you want to do with that language. Try them. Fail. Swear. Try again. Don’t get discouraged.

Douglas Kranch gave a good description of how expertise develops,

” Expertise develops in three stages. In the first stage, novices focus on the superficial and knowledge is poorly organized. During the end of the second stage, students mimic the instructor’s mastery of the domain. In the final stage, true experts make the domain their own by reworking their knowledge to meet the personal demands that the domain makes of them. “

This is why those first two bits of advice matter. In learning programming it is easy to get bored or discouraged as you go through those first two stages. It’s easy to start believing it’s too hard, that guy who told you women don’t have the same natural talent for programming was right, it’s too late for you to start now because you didn’t take enough math in college …

“I understand how you can see it that way.”



In my analysis of data on the oldest old from the Kaiser Permanente study, I started with cigar smoking because I have a friend who turned 65 last month. His annual physical went like this:

Doctor: Do you still smoke cigars?

Jim: Yep.

Doctor: Do you still drink 8 or 9 beers every night?

Jim: Yep.

Doctor: Are you going to quit?

Jim: Nope.

Doctor: Well, see you next year.

You know how old people always tell you that it doesn’t matter if they quit now, because they’re too old to die young anyway? I got to wondering if that was true, if people who were smoking and still alive at age 65 would be any more likely to die. (We’ll save the 8 or 9 beers thing for later.)

My first analysis was to look at the data by gender because I suspected that men were more likely to smoke cigars and I knew that women have  a longer life expectancy. When I did a table analysis by gender and cigar smoking using SAS Enterprise Guide, I found that only one of the 288 cigar smokers in the sample was female. So, it was clear to me if I was looking at the impact of cigar smoking on mortality I should probably only look at men.

The average age at death for men who smoked cigars (N=175) was 84.2 years, compared to 84.2 years for men who did not smoke cigars (N=1,147). Since there was some rounding here, p does not equal exactly 1.0 but rather p > .90.  So, it does not seem that quitting cigar smoking would improve your life expectancy given you have already made it to age 65.

However, that is only people who died. Perhaps not smoking will increase your probability of survival? Let’s just take a simple chi-square analysis comparing people who smoked cigars to those who did not and seeing what is the probability they died over the nine years they were followed.

Of the cigar smokers, 13.2% died during the nine years participants were followed, compared to 13.1% of men who did not smoke cigars (chi-square = .07, p > .78).

Still, I am not convinced. Maybe cigar smokers were older, or less likely to be overweight or drink (well, I doubt that one).  Anyway, being the good little statistician, I ran a logistic regression with death as the dependent variable and BMI, education, age at the beginning of the study (baseline), whether or not the subject had ever drank alcohol and whether or not he smoked cigars as the predictors.

Table of Type 3 effects

As you can see, the only two significant factors were age (older people were more likely to die in the next nine years – not surprisingly since the study began with people 65 and over) and your education. I’m guessing education is a proxy for social class.

So, it looks as if Jim is okay to keep smoking his cigars, given he has made it this far. I’m still not convinced about the 8 or 9 beers, though. That’s my next analysis.




(Yes, that title does sound like a lot of the spam comments I get. )

Last year, at the Gov. 2.0 conference in Santa Monica, Jean Holm, from NASA spoke about some of the opportunities for open data. I left with mixed feelings. On the one hand, the best examples she gave were, I thought, of  “semi-open” data, that is, a term I just made up for having more openness of data within an organization.  In one example, there would be a database of the capabilities of researchers within NASA, so if I was a NASA electrical engineer and I had an idea for designing a better electrical system for a lunar module, I could find out who had related expertise in hardware design, reliability testing, etc. That is a great idea, but it also makes me wonder to what extent open data within an organization would be put to use. It depends a lot on the organization.  Many large institutions – whether corporate, government, non-profit or educational – are not very excited about people going around the usual chain of command or departmental structure, no matter how many times they chant, “Think outside the box!”

Many times, both on this blog and elsewhere, I’ve questioned the probability of useful discoveries coming from open data unless the individuals doing the analysis have some knowledge of statistics, programming and the structure of the data actually being used .  If we have ten thousand each people doing 100 analyses, how do we decide which half of the 100,000 statistically significant results is important and which are in the group of 50,000 statistically significant results we would expect to occur by chance with 1,000,000 analyses?

Ten thousand people doing 100 analyses = 1,000,000 results.  One of those will have a p-value of .0000001 just by chance. A one in a million coincidence happens one time out of a million, right? So, let’s say we get three of those p < .0000001 events. How do we know which ones matter?


She said, sure you can have people correlating everything and come up with nonsensical relationships like between number of flashlights sold and solar flare activity. Presumably, somewhere “out there” are scientists or consumers of data or someone who will be able to identify the real findings from the flashlight sales- solar flares relationships. Having read a lot of academic journal articles that appear to have been both written and edited by people who were either not paying attention, inebriated or both, I am not so optimistic.

So, I decided to do an experiment and see just how far I could get with some samples of open data. The first data set I chose was from the Kaiser-Permanente study of the oldest old. This is actually two data sets.

One of the reasons I chose these data are that they come with pretty comprehensive documentation. For example, after reading through several hundred pages, I knew that the first data set was the master file and the second was a hospitalization file. In my experiment to see if I could find anything useful in here at all (other than what had already been published), I decided to use just the master file.

A second reason for selecting the oldest old study is that there are some published statistics I can verify my results against to see whether I have the data read in and coded correctly from the beginning. For example, the number of deaths I had in the first cohort (1,565) and second cohort (1,751) matched their figures exactly.

I did not start out with any preconceived notions other than the general public, assumptions like the older people were at the beginning of the study, the more likely they were to die before it was over.  While I’ve worked on some health statistics studies in the past, I am not a physician. This is one reason I used the master file, instead of the hospitalization one. I know what acute myocardial infarction is but I could not really generate much in the way of hypotheses about it nor the accuracy of diagnosis.  On the other hand, dead or not dead is pretty objectively measured.

I started with cigar smoking because I have a friend who turned 65 last month. His annual physical went like this:

Doctor: Do you still smoke cigars?

Jim: Yep.

Doctor: Do you still drink 8 or 9 beers every night?

Jim: Yep.

Doctor: Are you going to quit?

Jim: Nope.

Doctor: Well, see you next year.

You know how old people always tell you that it doesn’t matter if they quit now, because they’re too old to die young anyway? I got to wondering if that was true, if people who were smoking and still alive at age 65 would be any more likely to die. (We’ll save the 8 or 9 beers thing for later.)

My first analysis was to look at the data by gender because I suspected that men were more likely to smoke cigars and I knew that women have  a longer life expectancy. When I did a table analysis by gender and cigar smoking using SAS Enterprise Guide, I found that only one of the 288 cigar smokers in the sample was female. So, it was clear to me if I was looking at the impact of cigar smoking on mortality I should probably only look at men.

The average age at death for men who smoked cigars (N=175) was 84.2 years, compared to 84.2 years for men who did not smoke cigars (N=1,147). Since there was some rounding here, p does not equal exactly 1.0 but rather p > .90.  So, it does not seem that quitting cigar smoking would improve your life expectancy given you have already made it to age 65.

However, that is only people who died. Perhaps not smoking will increase your probability of survival? Let’s just take a simple chi-square analysis comparing people who smoked cigars to those who did not and seeing what is the probability they died over the nine years they were followed.

Of the cigar smokers, 13.2% died during the nine years participants were followed, compared to 13.1% of men who did not smoke cigars (chi-square = .07, p > .78).

Still, I am not convinced. Maybe cigar smokers were older, or less likely to be overweight or drink (well, I doubt that one).  Anyway, being the good little statistician, I ran a logistic regression with death as the dependent variable and BMI, education, age at the beginning of the study (baseline), whether or not the subject had ever drank alcohol and whether or not he smoked cigars as the predictors.

Table of Type 3 effects

As you can see, the only two significant factors were age (older people were more likely to die in the next nine years – not surprisingly since the study began with people 65 and over) and your education. I’m guessing education is a proxy for social class.

So, it looks as if Jim is okay to keep smoking his cigars, given he has made it this far. I’m still not convinced about the 8 or 9 beers, though. That’s my next analysis.




After watching Black in America: The New Promised Land, about eight black entrepreneurs trying to get traction in Silicon Valley, I read some of the articles on line about it and some of the comments (always a mistake), comments which served up further proof that there should be some sort of IQ requirement to use the Internet. (In fairness, this is my response when reading comments on almost any article.)

One comment that particularly annoyed me regarded the follow up to the founders, “where are they now”. One of the start-up founders featured commented he wanted to build his application himself so “I’m learning a new programming language.”

This led to a comment from someone who said this just proves that all these entrepreneurs were unqualified because true geeks did not need to learn a new programming language. Once they know one programming language, they know them all.

My ride to FORTRAN77 course

My ride to FORTRAN77 course

When I was in college, back when I flew my pterodactyl to school, I had classes in FORTRAN and BASIC. Later, when I was working as an industrial engineer, we all took a class on COBOL, for some project that never materialized. If you work in a big company long enough, you will eventually do work for a project that never happens. Next year, it will be 30 years that I have been programming with SAS. I’ve written everything from analyses of data sets with millions of records to applications to pull nightly data from one system, run statistics, create reports by department and output web pages of daily,  weekly and monthly results to a different system to just about everything else you can imagine. I wrote my first computer game to teach kids math in BASIC when I was in graduate school back in the mid-1980s, just for a class I was taking.

So, when I read this comment I just could not believe the stupidity of it. Lately, I thought I should learn a new programming language because I have some projects coming up where doing it all with SAS is not the best idea. I started with Ruby and I really liked it. However, the house rocket scientist thought javascript might be better for my purposes and he kept leaving little hints around like a book on Canvas and another on HTML5 (which was mostly javascript) and just dropped a book in my office the other day, Javascript, the good parts.

So far, I really like javascript and it has been fairly easy to pick up. I’ve written a couple of simple games, just for practice. Now to the idea that anyone who is a “true geek” has a magical decoder ring is so stupid as to render me speechless. (But not type-less, as evidenced by this blog.)

Let’s take a simple example of code from Jeanine Meyer’s book, The essential guide to HTML5.  This is from the middle of a program that just makes a ball bounce around  the screen at varying velocity. Not much in itself but I can certainly see how it can be useful incorporated into a game.
function moveball() {
ctx.beginPath() ;
ctx.arc(ballx,bally,ballrad,0,Math.PI*2,true) ;
ctx.fill() ;
ctx.strokeRect(boxx,boxy,boxwidth,boxheight) ;
function moveandcheck() {
// Set new X and Y positions
var nballx = ballx + ballvx ;
var nbally = bally + ballvy ;
if (nballx > boxboundx ) {
ballvx = -ballvx ;
nballx = boxboundx ;

Because I have used other languages, I immediately think of the function like creating a macro in SAS. The empty parentheses denote no parameters. I can guess that the { means the beginning of a function. From both Ruby and just plain geometry, I can guess that the next statement does something to a rectangle and those four values are going to define the area of the rectangle. It seems likely the next statement is calling another function (which, sure enough, I see defined later in the program.  The ctx.arc looked to be drawing a circle, x and y are probably the midpoint, I assume ballrad is the radius. Obviously (from having learned a little bit of Ruby) Math.PI is going to be pi, i.e. 3.14 etc.  Without having read Meyer’s book, or an equivalent, I wouldn’t have known what the 0 or true did in that statement.

We can go through the whole snippet of code that way… the var statements are similar to a SAS ATTRIB statement, but like in a lot of programming languages, these variables are being defined up front. As with a SAS macro or many other examples in many other languages, variables defined within a function / macro are local variables. Variables defined outside of the function earlier in the program (not shown) are global variables … and so on.

All of this is to say that, yes, it is easier to learn a new language if you already know one or two well. However, I never would have known that functions in javascript begin and end with “{” unless I read it in a book. I would not have known that  the SAS equivalent of IF- THEN DO used curly brackets also. I wouldn’t have known any of the specific syntax, like the CASE keyword or using Math.PI instead of just PI and hundreds of other specifics to javascript.

Learning a new language is easier the more languages you already know, but it takes time, no matter who you are. Learning the first one takes more time.

I was very disappointed, although I hate to say, not surprised, by the comments that followed the articles on Black in America. They ran about 10 to 1 saying there is no discrimination, and besides these people were not very good so that is why they did not get funding. Their ideas were awful.

I don’t know that they were any more awful than ideas that have gotten funding and made millions. Take Farmville, for example. I think that is the stupidest idea ever and have never played it but lots of people I know spend hours on it each week and it has certainly been profitable. I really like twitter and am on it hours each week. I learn a lot of useful information from many people I follow and are amused by others. Similarly, I have many friends who think that twitter is the dumbest invention since the mushroom brush. (Can you believe there is actually a site named ? )

Actually, I think the comments reflect the racism that is alive and well in Silicon Valley and America, it makes me sad. It reminds me of a social psychology study many years ago when subjects were posed a question something like this …

If an African-American candidate was equally qualified for a position, in a department where everyone else on the staff was white, do you think it would ever be acceptable for that candidate to be selected instead of the white candidate?

All of the African-American subjects said, “Yes”.

The white subjects almost all began their answers with,

Well, if the person REALLY was equally qualified ….

IF the person was equally qualified …

The investigator found this curious because the question began by stating that the two had equal qualifications. He said it was as if the subjects had difficulty believing that the black candidate could really be as qualified as the white candidate, even though the researcher had definitely stated that was the case.

I could not find this article again, but if you are more into the social psych literature than me and can point me to this reference, I would appreciate it. On searching for it though, I did come across this interesting study citing the justification for affirmative action. Those who are wondering if there is any hard evidence of discrimination in employment might want to read it. Short answer, Yes.

From personal experience, I can tell you that over many years, on many occasions, including a few times in the past year, I have been in meetings discussing some code that I had written, some analyses I had done, when someone would turn to a junior colleague and ask him or her questions, assuming the work had been done by anyone other than the Latina grandmother sitting at the head of the table. The last time this happened, I brought it up to an executive from the organization who had been in the same meeting. She said,

Well, you can’t really blame them for assuming someone else had done the programming. Everyone in the meeting was Asian or Indian except for you.

I had no response because my jaw was still hanging open when she walked away.

Do I have an answer? Maybe a little bit of one. Yes, there is prejudice and discrimination. You can still be successful. DO learn new languages. DON’T believe people who tell you that you are “not a true geek”, “your start-up is a stupid idea”. DON’T let any idiot tell you you’re not successful or competent because you’re not the next Mark Zuckerberg or Marc Andreesen.  Neither are they.

As my third daughter likes to say in response to comments putting her down,

Okay, random person on the Internet typing this from his mother’s basement, maybe you don’t think I can be best in the world, but I’m not  going to let that stop me.



Before the semester began, I debated about requiring SAS on-demand for my statistics course. In fact, after giving it some thought, I decided to make use optional rather than mandatory.  One reason for my hesitation was uncertainty about basing a major part of students’ grades on a project requiring an untested software package. I could see the possibility for disaster. Although it took a good bit of my time to prepare, that was NOT a major issue for me. When I was a full-time faculty member I was constantly frustrated by feeling I did not have enough time to do the best possible job for each of my classes and students. Now, by choice, I teach only once or twice a year, at most.

No, my other concern was that I might be requiring too much of the students. Many of them had never had a statistics course before this class. Out of 19 students, at most two or three of them had as much as one semester of Calculus. I breezed through descriptive statistics, covered correlation, ANOVA, multiple regression and logistic regression in depth, and touched on mixed models, survival analysis, factor analysis and a tiny bit of structural equation modeling and hierarchical models. On TOP of all of that, they were going to have to learn at least enough about SAS to run analyses on actual data and give a conference-type of presentation. It has been over a quarter of a century since I took my first graduate course in statistics. (That was back when people went to graduate school in their twenties and that was all they did. I know that seems quaint to you all today.)  Maybe this is going to make me sound like one of those old fogeys who claim to have walked to school in the snow, uphill both ways.

Trees in the snowStill, the truth is that graduate school has become watered down over the past few decades. Professors are supposed to understand that students “have to work” and are not expected to give as much in the way of assignments so as not to unduly burden the paying customers – er, students. Students at many schools are either subtly or openly encouraged to “go hire a statistician” to help with their dissertations.

Honestly and truly, when I was in graduate school, I did not even have A CALCULATOR to do the homework problems in statistics because calculators were very expensive and I started graduate school with a preschooler and had two more babies my first two years. As my advisor grumpily said to me,

“Listen, I’m Catholic, too, but there’s such a thing as taking it to extremes.”

I was an industrial engineer and programming with SAS before I went back for a Ph.D. so I actually would telnet (remember telnet?) into the university server and run SAS programs to check my statistics homework problems, because graduate students got X number of hours on the computer for free (remember PAYING for computer time?)

I thought about it for a while and concluded,

“Screw it! These are doctoral students at a selective university. They’re getting a doctorate, and they’re smart. They ought to be able to do the work and they WILL learn something and get their money’s worth out of this course, whether they want to or not.”

As I said, this experiment could have turned out to be a disaster in many ways. The software could have not worked. The students could have complained to the administration about the work load. The administration could have told me I was being unreasonable.

It turned out that the software did require some advance preparation and extra work during the course. A lot of pre-processing of open data sets was done by me. However, by the third week of the course, the students had split into five groups for their research projects and every group had at least one person with a computer with SAS On-demand installed. Four out of the five groups ended up using SAS On-demand for their research. I strongly encouraged them to submit their research for presentation at the Western Users of SAS Software conference in September or the Los Angles Basin SAS Users Group this summer.

It also turned out that the students really DID want to get their money’s worth out of their courses. I heard from several of them off the record and let me just say that really bright students know whether they are being challenged or just passed through and their tuition checks cashed and they appreciated the former.  This may well be because they are all working adults and could see the possibility of applying SAS skills and statistical analysis in a work setting, and they also see the competitive environment for employment right now.

The administration seems happy enough. I still get my checks direct deposited and invited to departmental parties, so I guess that is a good sign.

It WAS a happy adventure and the main reason, as I stated in an earlier post, is because of the kinds of research my students were able to do.

Let me just give you one example of what came out of this semester:

One group was interested in testing the hypothesis:

Are African-American women less likely to get married if they have more education?

The group members – three women, two of them African-American, thought that the answer was, “Yes”. Their first reason was that they thought some men might be intimidated by women with more education, and that statistics showed African-American women were obtaining degrees at a higher rate than men. Also, they thought women who had degrees might be less willing to “settle”, that they wouldn’t feel like they had to get married, so would be more likely to stay single.

My husband, the real-live rocket scientist (now retired as of a few weeks ago), disagreed vehemently when I told him about this. Here is evidence you are doing interesting research – that your professor discusses it at home and it leads to debates. He said they were looking at it from a woman’s perspective. As a man, he wanted  a woman with an education, someone who was not boring and not looking at him as a meal ticket.  He said that the women in the group were looking at it from a female perspective – women with more education may feel less need to get married. He thought they failed to consider the male perspective, which is that more men might WANT to marry them if they had more education.

So…. how did it all come out?

Bar chart of married mean by educationAs you can see, he was right. Of the women aged 18 – 65 years, 43.9% of those with a graduate degree were married versus 39% of those with less than a high school diploma.

Age is a confound here, however.  Education has been rising for African-American women over the past few decades, so older women are more likely to be married (having had more years to get married) and less likely to obtain higher education.

So, the group conducted a logistic regression with marriage as the dependent variable and education, age and wages earned in the past year as the independent variables. They found that education was still a positive factor in predicting marriage after controlling for age and income.

The students used education as both a categorical variable and a continuous variable and found the same result.

Because this made me curious, I re-analyzed the data several ways, using women from age 16 and up, then age 18 and older. (Seriously, this is California, who gets married at 16?). I looked both at currently married (yes/no) and ever married (yes/no) and education was still significantly (  p < .0001) positively related to marriage.  Earned income also consistently showed a positive relationship with the probability of marriage.

So, we have scientific evidence – men like smart women. Successful ones, too.

This is just one example of four groups that presented using SAS On-demand. I’ll try to get around to discussing others later.

Hopefully, you’ll see them at WUSS. Hey, if you’re one of those men seeking smart women, you should look them up – I know at least two of them are still single!






It is not every day that my refrigerator provides insight into a statistical problem. My daughter gave me this magnet.


which led to my thoughts on life expectancy using open data. Kaiser Permanente collected data on two cohorts of patients, those who were 65 years old or older in 1971 and in 1980.  After having published some research supported by the National Institute on Aging, on topics of interest to themselves and their funding agency, like cardiovascular disease, the investigators made their data available through ICPSR where I downloaded it.

I had read elsewhere that the life expectancy had increased even within that decade. If that is true, I reasoned, looking at the survival curves for people from the 1971 cohort (1) and the 1980 cohort (2) would show some differences.







When I look at the survival curves by strata


PROC LIFETEST DATA = saslib.death ;

STRATA cohort ;

TIME yrslived *dthflag(0) ;


I get the survival curves above and it is pretty clear they are the same. If anything, it looks like Cohort 2, those born later, actually had a slightly higher mortality near the end of the study.  For those of you who feel uncomfortable just eyeballing the curves, even when they are as close to identical as this, the Log-Rank test  for equality of strata = 1.27 (p > .25).

Real hand soapsAnd yet, on the other hand, when I did a t-test by age, I found that the 1980 cohort did live significantly longer, with those who turned 65 in 1971 having a mean age at death of 84.7 while those who turned 65 in 1980 had a mean age at death of  85.3  (p = .01 ) which leads to the conclusion that people from the 1980 cohort did live longer.

What’s going on here? This is the point where the getting to know your data part that I am always harping on comes into play. Note that Kaiser-Permanente said that they collected data on people who were 65 or older  in 1971 or 1980, not who turned 65.

In fact, the two samples were not the exact same age. The mean age of the 1971 sample was 75.7 and of the 1980 sample 76.1 . So, of that .6 year difference in lifespan, .4 of it existed before the study even started.

What difference would that make? Well, let’s go back up to my refrigerator magnet. What does the fact that someone has lived to 65 tell you? Most unequivocally that they did not die of anything at an age earlier than 65. They weren’t killed in the Vietnam War when they were 19 years old, in a car crash with a drunk driver when there were 33, from colon cancer at 56. Because 100% of the population of people who live to 65 have escaped these hazards for the first 65 years of life, they are NOT representative of the population in terms of life expectancy. This is why when you read articles they have statements like, “For an American male who has lived to age 65, life expectancy is ….”

The qualifying phrase is necessary because those who have had more birthdays already are expected to live longer than the general population.

So, I pulled out those who were 65 when the study started and looked at the survival curve


COHORT 1 (1971) AND COHORT 2 (1980)

Survival curve 65 at start of study

A t-test of years lived for those in Cohort 1 versus Cohort 2 using only the 290 subjects who were 65 years old at the start of the study produced a very non-significant t-value of .56 (p > .50) .

T-tests for subjects at age 75 and age 85 produced similar results.  So, based on these data, the answer to the question of at least whether patients of Kaiser Permanente have increased in life expectancy over the 1970s is, “No”. This isn’t a comment on Kaiser Permanente one way or the other, merely an observation that it is unlikely that their patients are completely representative of the population.
Just an aside, a million points to people who put their data on the web and open to all comers. This shows two traits I admire. The first is generosity, allowing someone else to benefit from your efforts in collecting the data, with no expectation of return. The second is courage. It takes a good amount of courage to publish your results and then make your data available so anyone who wants can re-analyze the data and perhaps come up with a competing conclusion. So, props to you.

P.S. You can buy the hand soaps on etsy. I have no affiliation with them, I just thought they were funny.




I watched Black in America: Silicon Valley, The New Promised Land with Soledad O’Brien on the accelerator for black entrepreneurs. It was a terrific show and I highly recommend it. An interesting comment was made by one of the venture capitalists when O’Brien asked him why a white, wealthy male like him cares about diversity in Silicon Valley. He answered that there were huge opportunities in technology as well as huge global competition. In short, if we don’t take advantage of all of the talent we can in this country then some other global player that does is going to come along and eat our lunch.


It reminded me of an article I read about Bill Gates. He was speaking in Saudi Arabia to an audience that was four-fifths male and segregated, with the women in the audience divided from the men by a partition. A participant asked him how Saudi Arabia could become a major player in the technology field. He responded that as long as they continued to under-utilize half the talent in the country, they were never going to make it into the top ten.

This fits with a pattern I have noticed. After all, I’m a statistician and finding patterns in data is what I do. The countries that are the most privileged in terms of education, life expectancy, gross domestic product and median income are all countries with a high degree of gender equality. I think this makes perfect sense, for the exact reason Gates noted in his speech. The only countries with a high degree of gender inequality that are not extremely poor have oil. Eventually, there will be less of a demand for oil. Everything from Segways to hybrid and electric cars to concerns about global warming to a change in emphasis on buying useless crap points to a decline in demand. Anyone who thinks that demand and prices will remain high forever is encouraged to look at the cycles over time for housing, steel or coal. I haven’t studied history a lot, but enough to know that predictions of permanency are nearly always wrong.

But I digress, from my point which was – yes, what was my point – oh yes, that at least some people with deep pockets in Silicon Valley have expressed a belief that under-utilizing a substantial percentage of your population leads to a disadvantage in terms of competitiveness. Further, the evidence, at least on an international scale, does seem to point in that direction.

If start-ups really are an exclusively young, white or Asian male, Ivy League phenomenon then we should be concerned and try to expand that pool. That’s what the NewMe accelerator featured in the Black in America special tried to do.

The show and the idea of working at an accelerator really fascinated me. The house rocket scientist and I fit the prototype of a successful start-up team in many ways. We can afford to spend full time working on a product at no pay, (not, like most founders because we are young and have no expenses but because we are older and have paid off enough and saved up enough that we could get by without working for years, maybe forever). We have years of experience programming. We have degrees in technical fields. According to the Start-up Genome Project, start-ups are more likely to succeed if they have two founders (check). We have one founder who is more at the business end, another more at the technical end. We are more “technical-heavy”, though, and we are more product-centric, as their data suggest is more likely to be successful in teams like ours. Unlike a lot of younger founders, we don’t need co-working. We have a seven-computer LAN in our house, with two desktops with 1TB each of internal storage, two 2 TB backup drives, an off-site server for shared storage and back-up to facilitate working with partners around the country, printers, copiers, fax machines, Linux, Windows & and Mac OS operating systems, both the most current OS for development and older ones for testing. Basically, software and hardware we would need, we have it. We have a stellar accountant to take care of taxes and payroll, willing test sites for alpha and beta testing. I could go on a long list.

There are no doubt business accelerators that would find us a perfect fit and I have considered applying to several. So, what is stopping us if it is not technical skills, money or ability to take time to devote to the project and travel.

My daughterThis … one of the people in the NewMe accelerator was a single mother of three children. She did not say who was watching them while she was in Silicon Valley. I presume a family member. Another founder had a seven-month-old son who he had left with his wife. No matter which way I turned it, no matter what excuses I made and how I rationalized it, it came back to the question of, “Would I leave my daughter for nine weeks?”

and the answer was a resounding, “No.”

Yes, I know that people in the military leave their children for months on end. My father was in the air force. I know that some people have to travel for business and are gone for long periods. While I travel a lot, the longest I have ever been away from my daughter is a little over two weeks.

The rocket scientist and I talked it over. We understand perfectly that there is a potential to make a lot of money, do exciting work. We could justify it by saying that if we are successful that it will give us the opportunity to provide our daughter with things she would not otherwise have. (While that is true, she is already the world’s most spoiled 13-year-old. True, she does not own a Porsche, a condominium in Malibu or the entire Brandy Melville line, but none of that would really improve her life, regardless of what she tells you.) The truth is that our daughter has everything she needs and 95.16% of what she wants. We could say that it would teach her about sacrifice, work ethic, pursuing a goal, motivation. Both of us came back with the same answer, an unequivocal, “No.”

I read a story about a basketball player who never played on Sunday because he never skipped church. His college team made the playoffs and their final game was on a Sunday. Although his teammates knew he did not play on Sundays, they could not believe that he would not skip church, “Just this once.” The author decided not to play in the final game. Many years later, he had no regrets. He said that he had found it was much easier to hold to your convictions 100% of the time than 98% of the time.

We believe our daughter is more important than any amount of money or career accolades we might receive and that being away from us for nine weeks would not be best for her.

Why should anyone care? As many of the commenters on many of the articles on the NewMe accelerator said, to hell with anyone who doesn’t fit in the mold. If you don’t want to eat ramen when you are young, live eight to a house in Silicon Valley, it’s your loss, that’s what you need to do to play the game.


Or maybe there needs to be a wider range of accelerator options and incentives.


Because I agree with Bill Gates and with the VC investor in Black in America, our country needs new start-ups, new jobs, new technology far more than I need to make enough money to buy Sweetpea a Porsche.




Ask me anything: Part 2

December 9, 2011 | 1 Comment

Continuing on with questions students asked at the end of the semester …

Note that the following questions were asking what I personally do, and I answered the same way. These are not rules that anyone has to follow, like taking the square root of the variance to find the standard deviation, but they are, I think, generally good advice.

Are there tests/task you always do when analyzing data?

Yes. If using SAS Enterprise Guide I always do the Characterize Data task. If using another software package, I always do descriptive statistics (DESCRIPTIVES in SPSS, PROC CONTENTS, PROC MEANS and PROC FREQ in SAS or in Stata, DESCRIBE, SUMMARIZE & TABULATE ) .

The reason I do this is that I want to get a good look at my data and make sure there are no real problems. If the data are no good, there isn’t much point in going any further without fixing it.

Is there a specific order to the tests you perform when analyzing data ?

Yes. I do descriptive statistics first to check for data quality, do a reality check (see that all of my subjects aren’t the same age, “rutabaga” is not a value entered for race and so on). I also do descriptive statistics to get an idea of what type of sample these data represent. Are they middle school students, older adults, a random sample of the population of California?

Next, I do bivariate statistics, both descriptive and inferential. I want to take a look at the simple relationships first. That way if later on I notice that there is a relationship between living near the coast and SAT scores, before I go off on a tangent about the salt air developing brain cells I am also aware that mean housing prices are significantly higher by the beach, and I consider the possibility that it might just be the same old correlation of SES and academic achievement reported thousands of times over.

Mom and toddler

Trust is a great basis for a relationship. For research, I want data.

Next, if I am going to use any measures, say a scale of attitudes toward innovation, I compute at a bare minimum internal consistency reliability. I say as a bare minimum because you should always be able to get that. All you need are the answers to the individual items. If I don’t have those individual items, I am going to be very uncomfortable using the scale because I am just going on trust that the items were coded correctly and the scale was scored correctly. Trust is a good basis for a personal relationship but I don’t like to base my results on it.

Finally, if appropriate, I would do any multiple regression, logistic regression, survival analysis or other technique using multiple predictor variables.

How do you select what statistics you want to report?

First off, ten points extra credit for knowing that I always run far more statistics than I report. There are two reasons I do this.

First, I run statistics to assess data quality and representativeness of the sample which are not of much interest to the reader. Even if the documentation for a data set swear up and down with pictures of angels that the sample was randomly selected from the population, I am going to at least take a look at whatever sample demographics I have available to determine if it is evenly split by gender, has a distribution by race that approximates the population and any other variables I can imagine. Sometimes, a variable like “whether or not there is a mother in the home” may show up in unexpected ways. More often, everything is exactly as expected and I don’t necessarily report that “the American Community Survey is a representative sample of the state of California, just like the Census Bureau claimed”.  So,  I generally don’t report statistics measuring quality of the data and sample. If they are positive there’s nothing of interest to say and if negative I need to fix the problem or not proceed with the study.

Second, I run a lot of models and look for convergence. I may run a PROC GLM and PROC MIXED, including school as a random effect, and not at all. I may run a logistic regression where I split campuses by “reported more than ten crimes” and “reported less the ten crimes” and use that as the dependent variable rather than number of crimes. I’m doing these multiple analyses primarily for me. I want to understand what it is going on here, whether major violations of the normality assumption are impacting the results and what would happen if I split the data a different way, such as zero crimes, one or more crimes. What I am NOT doing is running 100 analyses and reporting the five that are significant. On the contrary, I am looking at the same question four or five related ways and expecting all or nearly all to be significant. If every one doesn’t come out significant, I want to know why. The decision to accept or reject the null hypothesis should not be based on using exactly THAT measure, and coding the analysis using precisely THESE options. That, my friend, is not a very robust finding. It reminds me of a comment Dr. Donald MacMillan made when someone asked him what test was best to determine if a child was mentally retarded. He said,

“If you have to give a test to know whether he’s mentally retarded or not, odds are, he isn’t.”

What then do I report?

  1. Sample demographics
  2. Descriptive statistics for all independent and dependent variables used in the analyses
  3. Test reliability and validity data if the measures are not standard ones. (I would not report reliability for the SAT, but for Joe’s Test of Academic Achievement, I would.)
  4. Inferential statistics (assuming I ran any, which I usually did)
Photo by Andrew Lih. License

So, what about inferential statistics, which ones do I report? I think I am in the minority here because I generally choose more familiar, easier to understand statistics. If I had done a regression and a mixed model and they gave me essentially the same results, I’d report the regression. If I did a chi-square with one independent variable and correctly classified 80% of the tofu-eaters and my logistic regression with three variables, all of them significant, correctly predicted 82% of those with a predilection for tofu, I’d go with the chi-square. If my structural equation model didn’t really add a lot compared to doing three multiple regressions, I’d report the regressions. My main objective is to inform my clients, not prove how smart I am.

For me, a technique that is less accessible to the general public is going to have to have some substantial improvement in prediction or clarity (for example, documenting interaction effects) to justify its use over a more basic technique.


So, those are my answers for the day. Tune in next time for a completely unrelated post on why Black in America was awesome, accelerators sound exciting and I don’t think I could ever do one.






Ask Me Anything: Part I

December 6, 2011 | 2 Comments

It’s that time of year, near the end of the semester, when I ask students to write down any questions they may have about material covered in the course. This semester I am teaching Advanced Quantitative Data Analysis. I thought other people might be interested in the answers to the questions students asked as well.

Is it possible to have a Mean Square of 0 ?

Yes, it is possible but if your model mean square is exactly zero then it is extremely likely that something is wrong with your model, in addition to the obvious problem that your mean square is zero. Normally, even if you select a random number as a predictor or dependent variable, you’ll get some very small model mean square value, it won’t be EXACTLY zero. One instance in which you will get  a zero model mean square is when your dependent variable is a constant.

Another case in which you will get a zero for model mean square is if you are using SAS Enterprise Guide, your dependent variable is standardized, with a mean of zero, and you forget to specify effects for your model. If you fail to specify effects, SAS EG includes the intercept in the model. However, if your intercept truly is zero you will get a model mean square of zero.


I’d say if your model mean square is exactly zero something fishy is up.

Is there a limit to the number of variables to be used for a multiple regression?

If your question is whether there is an absolute number, like no more than 42, the answer is,

“No, there is no set maximum number of variables that can be used for a multiple regression.”

Obviously, you need to have at least two predictor variables or it is not a multiple regression. You cannot have more predictors than there are observations because then you cannot find a unique solution. If you have 30 subjects you can’t have 42 independent variables. On the other hand, if you have 3,000,000 subjects, 42 independent variables would work mathematically. It might be a difficult analysis to interpret, though and you might run into problems of multicollinearity.

When doing a 2-way ANOVAs with SAS Enterprise Guide, does it matter if I drag dependent variable under classification variables?

You drag INDEPENDENT variables under classification variable.  You would never put your dependent variable there.

Can type I sum of squares and type III sum have equal values sometimes?

They can and do. The Type I Sum of Squares is also called the sequential sum of squares. The sum of squares for each term is given controlling for the terms that precede it in the model. Some people don’t like Type I sum of squares because the SS can change for a variable if you change its order in the model. Let’s say we are predicting SAT scores based on SES (upper, middle, lower) and school type (private or public). If we look at the effect of school without controlling for SES and with controlling for SES we may get very different sums of squares.

The Type III Sum of Squares is also called the marginal sum of squares. It gives you the sum of squares controlling for all of the other effects in the model.

If you think about it for a minute, you’ll see that the Type I and Type III sum of squares are always the same for the last term entered in the model, in a two-way ANOVA, this is usually the interaction effect. For the last term, in Type I you get the sum of squares controlling for all of the effects entered previously on your model statement, which is all of the other effects. In the Type III sum of squares, you get the sum of squares controlling for all of the other effects in the model for all of effects, including the last one.

So …. if your model statement is

MODEL gpa = ses school ses * school ;

The Type I and Type III for ses* school will be identical.

If you leave off your interaction term and have

MODEL gpa = ses school ;

The Type I and Type III for school will be identical.

You will also get the same values for Type I and Type III for effects in the model other than the last one if there is zero shared variance. Just like with getting exactly a zero mean square, though, that is unlikely to happen. Even when I tried a couple of times just creating categories from a random number function

( IF RANUNI(4) > .42 THEN rand2col =  912 ;

ELSE rand2col = 875 ; )

colorful yarn balls with needlesThe Type I and Type III sums of squares were not EXACTLY the same. The Type III always showed just a tiny bit decrease from Type I due to a minute amount of shared variance being present just by chance. However, the Type I and Type III were the same for all practical purposes. For example, in one case I used attitudes toward abortion ( a scale) as the dependent variable, church type (fundamental or non-fundamentalist) and my random categories as the independent.

The Type I SS of church type was 31.34 (F = 8.56, p <.008) and Type III SS was 33.0 (F = 9.01, p < .007) .

The fact that I have time to create random categories in an attempt to find exactly identical Type I and Type III SS  is perhaps evidence that I need to find a hobby. Maybe knitting?






Rocks in Santa Monica mountains




Anyone who has ever taught has probably had this experience … you are reading a student paper and think,

“I really don’t understand what he/ she is trying to say here.”

Like most professors, you probably assume that in these situations, the students aren’t very clear on the concepts themselves. You’re probably right on that, too. If you spent enough years in school you probably caught yourself at least once facing  a question you had know idea what it was about so you tried to just write down whatever you did know and hope the professor was kind enough to find something in there somewhere to give you points. Sometimes this works.

Odd thing I’ve found though is that when it is the professor him/herself who is saying something that is not understood, again it is assumed that it is the students who do not really understand it.  I would say the assumption is questionable.

When you’re presenting, whether it be giving a lecture to a class, a conference presentation or a meeting, and the majority of your audience is confused, there are a few possibilities:

  1. You don’t understand the material.
  2. You don’t understand how to communicate your knowledge.
  3. You did not spend enough time preparing for your presentation.
  4. Your audience is not very bright.
  5. Your audience did not spend enough time preparing for your presentation.

Out of those possibilities, 60% are you. One of my pet peeves that I see at every point from high school classrooms to prestigious conferences is people who seem to have a disrespect for their audience. I often hear colleagues say that they don’t practice their presentations,

“I do better when I’m more spontaneous, otherwise it just sounds like I’m reading my notes.”

Or some bullshit like that.  No. You don’t. It is damn near impossible to always accurately estimate how long it will take you to give a talk. If you are teaching a class, that isn’t a big problem. You can pick up where you left off the next time.  In a meeting or conference, where people had to be scheduled to come together, juggle schedules, that isn’t always possible.  Sometimes, when your audience doesn’t understand it was because you had to leave out a lot of information they needed because you were rushed. Sometimes you assumed they had pre-requisite information and they didn’t.

How to fix that: I put my powerpoint up before every class. Students can read it ahead of time. I also often put up supplemental materials so that those students who have trouble can go read that prerequisite information. Conferences often have an option to post papers after the conference that go into more detail. I email my powerpoint to anyone who asks. I put them up on our company website when I have time and I end many presentations with references for additional information. I actually read my presentation out loud before a meeting and time it. Then I allow for a few minutes for introductions, people to get settled in, etc.  For great tips on doing good presentations,  I highly recommend the book, Rockstar Presentations.
Then there are the first two problems, where I have noted a recent, disturbing trend.  I’ve been a statistical consultant for nearly thirty years and I have had very, very few clients who were students who wanted help with their course work or dissertation. Our rates are not cheap – $100 an hour if I don’t have to leave my house. If I have to put on a suit, go anywhere further than my office downstairs or get up before 10 a.m., the rates go up sharply.

For consulting work, it is a good deal but it is not in the budget of most students. Yet, I am seeing more and more who are willing to pay it.  When I meet with the students and review the material from their courses, their assignments and the feedback they are receiving from their professors, I see why.
First, there are the professors who don’t really know all that much. I attended a university that required a minimum of four courses in statistics or research methods for every doctoral student. Three of those four courses also included a three-hour weekly computer lab.  Students who specialized in statistics took another five or six courses. One of those had a computer lab. After that, you were expected to do it on your own. Many of the programs producing Ph.D.s now require one, or at most, two, courses and no computer labs. The students might go to the lab two or three times during their program. That’s it.  Students receiving Ph.D.’s in statistics and mathematics tend to focus much more on theory than programming and very, very little on application to real problems.
These professors may know a lot about a lot of things. Unfortunately, one of those things is not what they are teaching, how to apply statistics, both conceptually and using statistical software, to conduct research. How do they publish? Most of them don’t. Some publish articles on topics that require little or no statistics, publish terrible articles in lower-tier journals or pay people like me to do the statistical analysis section for them. Just like the students, they have a hard time articulating ideas in statistics because they really don’t understand very well themselves what a planned orthogonal contrast, dummy variable or logit actually is.
Second are the professors who don’t know how to communicate the information. Sometimes it is because English is their second language. This has been a problem since I was a student and I frankly do not understand it. If you have a job where part of the requirement is to communicate information in English, why do you hire people who cannot speak English fluently? Why don’t the students object? Actually, they’ve been objecting since I was an undergraduate and no administration ever seems to give a damn. I don’t get it. There are also professors that English is the only language they speak and they just don’t communicate in it very well. Sometimes I think this problem is just plain arrogance, they don’t think it is worth their time to bother with communicating well. This has been explained to me as everything from that the university puts much more emphasis on research (often true) to the students just aren’t “college material” and should drop out or pick any easier major where said professor doesn’t have to be bothered with them.
While the students are certainly pleasant and interesting to work with, and I don’t mind the additional business, I remain puzzled by the whole picture.
Why do students put up with paying $40,000 a year for an education that leaves them needing to pay me $100 an hour to explain what their professor said in class?  Why do universities hire people who don’t really know their subject very well or who are not fluent speakers of the language they are supposed to be using to teach? Why don’t more universities go to the model of hiring people who can and want to teach as clinical or teaching professors and hiring the people who don’t want to be bothered with teaching as researchers?
These aren’t rhetorical questions. I don’t have the answers and I really am interested, so if anyone knows, please enlighten me.
keep looking »


WP Themes