# Cigar-smoking won’t kill you if you’re already old

December 27, 2011 | 1 Comment

In my analysis of data on the oldest old from the Kaiser Permanente study, I started with cigar smoking because I have a friend who turned 65 last month. His annual physical went like this:

Doctor: Do you still smoke cigars?

Jim: Yep.

Doctor: Do you still drink 8 or 9 beers every night?

Jim: Yep.

Doctor: Are you going to quit?

Jim: Nope.

Doctor: Well, see you next year.

You know how old people always tell you that it doesn’t matter if they quit now, because they’re too old to die young anyway? I got to wondering if that was true, if people who were smoking and still alive at age 65 would be any more likely to die. (We’ll save the 8 or 9 beers thing for later.)

My first analysis was to look at the data by gender because I suspected that men were more likely to smoke cigars and I knew that women have  a longer life expectancy. When I did a table analysis by gender and cigar smoking using SAS Enterprise Guide, I found that only one of the 288 cigar smokers in the sample was female. So, it was clear to me if I was looking at the impact of cigar smoking on mortality I should probably only look at men.

The average age at death for men who smoked cigars (N=175) was 84.2 years, compared to 84.2 years for men who did not smoke cigars (N=1,147). Since there was some rounding here, p does not equal exactly 1.0 but rather p > .90.  So, it does not seem that quitting cigar smoking would improve your life expectancy given you have already made it to age 65.

However, that is only people who died. Perhaps not smoking will increase your probability of survival? Let’s just take a simple chi-square analysis comparing people who smoked cigars to those who did not and seeing what is the probability they died over the nine years they were followed.

Of the cigar smokers, 13.2% died during the nine years participants were followed, compared to 13.1% of men who did not smoke cigars (chi-square = .07, p > .78).

Still, I am not convinced. Maybe cigar smokers were older, or less likely to be overweight or drink (well, I doubt that one).  Anyway, being the good little statistician, I ran a logistic regression with death as the dependent variable and BMI, education, age at the beginning of the study (baseline), whether or not the subject had ever drank alcohol and whether or not he smoked cigars as the predictors.

As you can see, the only two significant factors were age (older people were more likely to die in the next nine years – not surprisingly since the study began with people 65 and over) and your education. I’m guessing education is a proxy for social class.

So, it looks as if Jim is okay to keep smoking his cigars, given he has made it this far. I’m still not convinced about the 8 or 9 beers, though. That’s my next analysis.

# Travels through Open Data Land, with old people, flashlights & cigars

December 27, 2011 | 3 Comments

(Yes, that title does sound like a lot of the spam comments I get. )

Last year, at the Gov. 2.0 conference in Santa Monica, Jean Holm, from NASA spoke about some of the opportunities for open data. I left with mixed feelings. On the one hand, the best examples she gave were, I thought, of  “semi-open” data, that is, a term I just made up for having more openness of data within an organization.  In one example, there would be a database of the capabilities of researchers within NASA, so if I was a NASA electrical engineer and I had an idea for designing a better electrical system for a lunar module, I could find out who had related expertise in hardware design, reliability testing, etc. That is a great idea, but it also makes me wonder to what extent open data within an organization would be put to use. It depends a lot on the organization.  Many large institutions – whether corporate, government, non-profit or educational – are not very excited about people going around the usual chain of command or departmental structure, no matter how many times they chant, “Think outside the box!”

Many times, both on this blog and elsewhere, I’ve questioned the probability of useful discoveries coming from open data unless the individuals doing the analysis have some knowledge of statistics, programming and the structure of the data actually being used .  If we have ten thousand each people doing 100 analyses, how do we decide which half of the 100,000 statistically significant results is important and which are in the group of 50,000 statistically significant results we would expect to occur by chance with 1,000,000 analyses?

Ten thousand people doing 100 analyses = 1,000,000 results.  One of those will have a p-value of .0000001 just by chance. A one in a million coincidence happens one time out of a million, right? So, let’s say we get three of those p < .0000001 events. How do we know which ones matter?

She said, sure you can have people correlating everything and come up with nonsensical relationships like between number of flashlights sold and solar flare activity. Presumably, somewhere “out there” are scientists or consumers of data or someone who will be able to identify the real findings from the flashlight sales- solar flares relationships. Having read a lot of academic journal articles that appear to have been both written and edited by people who were either not paying attention, inebriated or both, I am not so optimistic.

So, I decided to do an experiment and see just how far I could get with some samples of open data. The first data set I chose was from the Kaiser-Permanente study of the oldest old. This is actually two data sets.

One of the reasons I chose these data are that they come with pretty comprehensive documentation. For example, after reading through several hundred pages, I knew that the first data set was the master file and the second was a hospitalization file. In my experiment to see if I could find anything useful in here at all (other than what had already been published), I decided to use just the master file.

A second reason for selecting the oldest old study is that there are some published statistics I can verify my results against to see whether I have the data read in and coded correctly from the beginning. For example, the number of deaths I had in the first cohort (1,565) and second cohort (1,751) matched their figures exactly.

I did not start out with any preconceived notions other than the general public, assumptions like the older people were at the beginning of the study, the more likely they were to die before it was over.  While I’ve worked on some health statistics studies in the past, I am not a physician. This is one reason I used the master file, instead of the hospitalization one. I know what acute myocardial infarction is but I could not really generate much in the way of hypotheses about it nor the accuracy of diagnosis.  On the other hand, dead or not dead is pretty objectively measured.

I started with cigar smoking because I have a friend who turned 65 last month. His annual physical went like this:

Doctor: Do you still smoke cigars?

Jim: Yep.

Doctor: Do you still drink 8 or 9 beers every night?

Jim: Yep.

Doctor: Are you going to quit?

Jim: Nope.

Doctor: Well, see you next year.

You know how old people always tell you that it doesn’t matter if they quit now, because they’re too old to die young anyway? I got to wondering if that was true, if people who were smoking and still alive at age 65 would be any more likely to die. (We’ll save the 8 or 9 beers thing for later.)

My first analysis was to look at the data by gender because I suspected that men were more likely to smoke cigars and I knew that women have  a longer life expectancy. When I did a table analysis by gender and cigar smoking using SAS Enterprise Guide, I found that only one of the 288 cigar smokers in the sample was female. So, it was clear to me if I was looking at the impact of cigar smoking on mortality I should probably only look at men.

The average age at death for men who smoked cigars (N=175) was 84.2 years, compared to 84.2 years for men who did not smoke cigars (N=1,147). Since there was some rounding here, p does not equal exactly 1.0 but rather p > .90.  So, it does not seem that quitting cigar smoking would improve your life expectancy given you have already made it to age 65.

However, that is only people who died. Perhaps not smoking will increase your probability of survival? Let’s just take a simple chi-square analysis comparing people who smoked cigars to those who did not and seeing what is the probability they died over the nine years they were followed.

Of the cigar smokers, 13.2% died during the nine years participants were followed, compared to 13.1% of men who did not smoke cigars (chi-square = .07, p > .78).

Still, I am not convinced. Maybe cigar smokers were older, or less likely to be overweight or drink (well, I doubt that one).  Anyway, being the good little statistician, I ran a logistic regression with death as the dependent variable and BMI, education, age at the beginning of the study (baseline), whether or not the subject had ever drank alcohol and whether or not he smoked cigars as the predictors.

As you can see, the only two significant factors were age (older people were more likely to die in the next nine years – not surprisingly since the study began with people 65 and over) and your education. I’m guessing education is a proxy for social class.

So, it looks as if Jim is okay to keep smoking his cigars, given he has made it this far. I’m still not convinced about the 8 or 9 beers, though. That’s my next analysis.

# My happy adventure with SAS on-demand

Before the semester began, I debated about requiring SAS on-demand for my statistics course. In fact, after giving it some thought, I decided to make use optional rather than mandatory.  One reason for my hesitation was uncertainty about basing a major part of students’ grades on a project requiring an untested software package. I could see the possibility for disaster. Although it took a good bit of my time to prepare, that was NOT a major issue for me. When I was a full-time faculty member I was constantly frustrated by feeling I did not have enough time to do the best possible job for each of my classes and students. Now, by choice, I teach only once or twice a year, at most.

No, my other concern was that I might be requiring too much of the students. Many of them had never had a statistics course before this class. Out of 19 students, at most two or three of them had as much as one semester of Calculus. I breezed through descriptive statistics, covered correlation, ANOVA, multiple regression and logistic regression in depth, and touched on mixed models, survival analysis, factor analysis and a tiny bit of structural equation modeling and hierarchical models. On TOP of all of that, they were going to have to learn at least enough about SAS to run analyses on actual data and give a conference-type of presentation. It has been over a quarter of a century since I took my first graduate course in statistics. (That was back when people went to graduate school in their twenties and that was all they did. I know that seems quaint to you all today.)  Maybe this is going to make me sound like one of those old fogeys who claim to have walked to school in the snow, uphill both ways.

Still, the truth is that graduate school has become watered down over the past few decades. Professors are supposed to understand that students “have to work” and are not expected to give as much in the way of assignments so as not to unduly burden the paying customers – er, students. Students at many schools are either subtly or openly encouraged to “go hire a statistician” to help with their dissertations.

Honestly and truly, when I was in graduate school, I did not even have A CALCULATOR to do the homework problems in statistics because calculators were very expensive and I started graduate school with a preschooler and had two more babies my first two years. As my advisor grumpily said to me,

“Listen, I’m Catholic, too, but there’s such a thing as taking it to extremes.”

I was an industrial engineer and programming with SAS before I went back for a Ph.D. so I actually would telnet (remember telnet?) into the university server and run SAS programs to check my statistics homework problems, because graduate students got X number of hours on the computer for free (remember PAYING for computer time?)

I thought about it for a while and concluded,

“Screw it! These are doctoral students at a selective university. They’re getting a doctorate, and they’re smart. They ought to be able to do the work and they WILL learn something and get their money’s worth out of this course, whether they want to or not.”

As I said, this experiment could have turned out to be a disaster in many ways. The software could have not worked. The students could have complained to the administration about the work load. The administration could have told me I was being unreasonable.

It turned out that the software did require some advance preparation and extra work during the course. A lot of pre-processing of open data sets was done by me. However, by the third week of the course, the students had split into five groups for their research projects and every group had at least one person with a computer with SAS On-demand installed. Four out of the five groups ended up using SAS On-demand for their research. I strongly encouraged them to submit their research for presentation at the Western Users of SAS Software conference in September or the Los Angles Basin SAS Users Group this summer.

It also turned out that the students really DID want to get their money’s worth out of their courses. I heard from several of them off the record and let me just say that really bright students know whether they are being challenged or just passed through and their tuition checks cashed and they appreciated the former.  This may well be because they are all working adults and could see the possibility of applying SAS skills and statistical analysis in a work setting, and they also see the competitive environment for employment right now.

The administration seems happy enough. I still get my checks direct deposited and invited to departmental parties, so I guess that is a good sign.

It WAS a happy adventure and the main reason, as I stated in an earlier post, is because of the kinds of research my students were able to do.

Let me just give you one example of what came out of this semester:

One group was interested in testing the hypothesis:

Are African-American women less likely to get married if they have more education?

The group members – three women, two of them African-American, thought that the answer was, “Yes”. Their first reason was that they thought some men might be intimidated by women with more education, and that statistics showed African-American women were obtaining degrees at a higher rate than men. Also, they thought women who had degrees might be less willing to “settle”, that they wouldn’t feel like they had to get married, so would be more likely to stay single.

My husband, the real-live rocket scientist (now retired as of a few weeks ago), disagreed vehemently when I told him about this. Here is evidence you are doing interesting research – that your professor discusses it at home and it leads to debates. He said they were looking at it from a woman’s perspective. As a man, he wanted  a woman with an education, someone who was not boring and not looking at him as a meal ticket.  He said that the women in the group were looking at it from a female perspective – women with more education may feel less need to get married. He thought they failed to consider the male perspective, which is that more men might WANT to marry them if they had more education.

So…. how did it all come out?

As you can see, he was right. Of the women aged 18 – 65 years, 43.9% of those with a graduate degree were married versus 39% of those with less than a high school diploma.

Age is a confound here, however.  Education has been rising for African-American women over the past few decades, so older women are more likely to be married (having had more years to get married) and less likely to obtain higher education.

So, the group conducted a logistic regression with marriage as the dependent variable and education, age and wages earned in the past year as the independent variables. They found that education was still a positive factor in predicting marriage after controlling for age and income.

The students used education as both a categorical variable and a continuous variable and found the same result.

Because this made me curious, I re-analyzed the data several ways, using women from age 16 and up, then age 18 and older. (Seriously, this is California, who gets married at 16?). I looked both at currently married (yes/no) and ever married (yes/no) and education was still significantly (  p < .0001) positively related to marriage.  Earned income also consistently showed a positive relationship with the probability of marriage.

So, we have scientific evidence – men like smart women. Successful ones, too.

This is just one example of four groups that presented using SAS On-demand. I’ll try to get around to discussing others later.

Hopefully, you’ll see them at WUSS. Hey, if you’re one of those men seeking smart women, you should look them up – I know at least two of them are still single!

# Are we living longer – or not?

December 13, 2011 | 3 Comments

It is not every day that my refrigerator provides insight into a statistical problem. My daughter gave me this magnet.

BIRTHDAYS ARE GOOD FOR YOU. STATISTICS SHOW THAT PEOPLE WHO HAVE THE MOST BIRTHDAYS LIVE THE LONGEST.

which led to my thoughts on life expectancy using open data. Kaiser Permanente collected data on two cohorts of patients, those who were 65 years old or older in 1971 and in 1980.  After having published some research supported by the National Institute on Aging, on topics of interest to themselves and their funding agency, like cardiovascular disease, the investigators made their data available through ICPSR where I downloaded it.

I had read elsewhere that the life expectancy had increased even within that decade. If that is true, I reasoned, looking at the survival curves for people from the 1971 cohort (1) and the 1980 cohort (2) would show some differences.

When I look at the survival curves by strata

`(`

PROC LIFETEST DATA = saslib.death ;

STRATA cohort ;

TIME yrslived *dthflag(0) ;

)

I get the survival curves above and it is pretty clear they are the same. If anything, it looks like Cohort 2, those born later, actually had a slightly higher mortality near the end of the study.  For those of you who feel uncomfortable just eyeballing the curves, even when they are as close to identical as this, the Log-Rank test  for equality of strata = 1.27 (p > .25).

And yet, on the other hand, when I did a t-test by age, I found that the 1980 cohort did live significantly longer, with those who turned 65 in 1971 having a mean age at death of 84.7 while those who turned 65 in 1980 had a mean age at death of  85.3  (p = .01 ) which leads to the conclusion that people from the 1980 cohort did live longer.

What’s going on here? This is the point where the getting to know your data part that I am always harping on comes into play. Note that Kaiser-Permanente said that they collected data on people who were 65 or older  in 1971 or 1980, not who turned 65.

In fact, the two samples were not the exact same age. The mean age of the 1971 sample was 75.7 and of the 1980 sample 76.1 . So, of that .6 year difference in lifespan, .4 of it existed before the study even started.

What difference would that make? Well, let’s go back up to my refrigerator magnet. What does the fact that someone has lived to 65 tell you? Most unequivocally that they did not die of anything at an age earlier than 65. They weren’t killed in the Vietnam War when they were 19 years old, in a car crash with a drunk driver when there were 33, from colon cancer at 56. Because 100% of the population of people who live to 65 have escaped these hazards for the first 65 years of life, they are NOT representative of the population in terms of life expectancy. This is why when you read articles they have statements like, “For an American male who has lived to age 65, life expectancy is ….”

The qualifying phrase is necessary because those who have had more birthdays already are expected to live longer than the general population.

So, I pulled out those who were 65 when the study started and looked at the survival curve

SURVIVAL CURVE FOR PATIENTS AGED 65

COHORT 1 (1971) AND COHORT 2 (1980)

A t-test of years lived for those in Cohort 1 versus Cohort 2 using only the 290 subjects who were 65 years old at the start of the study produced a very non-significant t-value of .56 (p > .50) .

T-tests for subjects at age 75 and age 85 produced similar results.  So, based on these data, the answer to the question of at least whether patients of Kaiser Permanente have increased in life expectancy over the 1970s is, “No”. This isn’t a comment on Kaiser Permanente one way or the other, merely an observation that it is unlikely that their patients are completely representative of the population.
Just an aside, a million points to people who put their data on the web and open to all comers. This shows two traits I admire. The first is generosity, allowing someone else to benefit from your efforts in collecting the data, with no expectation of return. The second is courage. It takes a good amount of courage to publish your results and then make your data available so anyone who wants can re-analyze the data and perhaps come up with a competing conclusion. So, props to you.

P.S. You can buy the hand soaps on etsy. I have no affiliation with them, I just thought they were funny.

# Can you say Caveat Emptor if the data are free?

Let the buyer beware – that phrase certainly applies to open data, as does the less historical but equally true statement that students always want to work with real data until they get some.

Lately, I have had students working with two different data sets that have led me to the inevitable conclusion that self-report data is another way of proving that people are big fat liars. One data set is on campus crime and, according to the data provided by personnel at college campuses, the modal number of crimes committed per year – rape, burglary, vehicle theft, intimidation, assault, vandalism – is zero. Having taught at a very wide variety of campuses, from tribal colleges to extremely expensive private universities, and never seen one that was 100% crime free, I suspected  – and this is a technical term here – that these data were complete bullshit. I looked up a couple of campuses that were in high crime areas and where I knew faculty or staff members who verified crime had occurred and been reported to the campus and city police. These institutions conveniently had not reported their data, which is morally preferable to lying through their teeth, with the added benefit of requiring less effort.

Jumping from there to a second study on school bullying, we found, as reported by school administrators in a national sample of thousands of public elementary, middle and secondary schools, that bullying and sexual harassment never, or almost never, occur and there are no schools in the country where these occur on a daily basis. Are you fucking kidding me? Have you never walked down the hall at a middle school or high school? Have you never attended a school? What the administrators thought to gain or avoid by lying on these surveys, I don’t know, but it certainly reduces the value of the data for, well, for anything.

So …. the students learn a valuable life lesson about not trusting their data too much. In fact, this may be the most valuable lesson they learn, Stamp’s Law

The government are very keen on amassing statistics. They collect them, add them, raise them to the nth power, take the cube root and prepare wonderful diagrams. But you must never forget that every one of these figures comes in the first instance from the village watchman, who just puts down what he damn pleases.

From an analysis standpoint, this is my soapbox that I am ranting on every day. Before you do anything, do a reality check. If you use SAS Enterprise Guide, the Characterize Data task is good for this, but any statistical software, or any programming language, for that matter, will have options for descriptive statistics  – means, frequencies, standard deviations.

This isn’t to say that all open data sucks. On the contrary, I’m working with two other data sets at the moment that are brilliant. One used abstracts of medical records data over nine years plus state vital records to record data on medical care, diagnoses and death for patients over 65. Since Medicare doesn’t pay unless you have data on the care provided and diagnosis, and the state is kind of insistent on recording deaths, these data are beautifully complete and probably pretty accurate.

I’ve also been working with the TIMSS data. You can argue all you want about teaching to the test, but it’s not debatable that the TIMSS data is pretty complete and passes the reality test with flying colors. Distributions by race, gender, income and every other variable you can imagine is pretty much what you would expect based on U.S. Census data.

So, I am not saying open data = bad. What I am saying is let the buyer beware, and run some statistics to verify the data quality before you start believing your numbers too much.

# Open Data, SAS On-Demand & African-American Women

October 25, 2011 | 3 Comments

Let me just say off the bat that open data is awesome and there should be more of it available.  This semester, I have been using SAS On-Demand in my statistics class and creating the data sets to meet students’  interests. Despite some people’s aspersions that I read on Twitter that some statisticians know no more than what PROC to use to get a p-value, it is, unfortunately, not all that easy.

I was going to write about adjusted survival curves and log log curves with PHREG tonight but it is already past 1 a.m. and both my time and Chardonnay are exhausted creating analytic data sets for my students.

I did hear back from the helpful folks at the National Center for Educational Statistics (thank you!) and downloaded the School Survey on Crime and Safety for a group of students interested in bullying. Awesome public use data set. Check it out!

After that, I had another group of students interested in testing the hypothesis that African-American women are less likely to get married the more education they have. Conveniently, I had the American Community Survey data for California on my desktop from some analyses I had done earlier, so I pulled out the subset of people they were interested in, which is native-born African-American women over 15 years of age. (Actually, the picture is my cousin who has never, as far as I know, been to America, but hey, Ashelle, if you’re reading this, come and visit. It’s nice here.)

I downloaded the data, created a few new variables to fit the students’  interest and emailed them the file and documentation. For example, they wanted to break education down into categories, thinking, rightly, I believe, that getting a high school diploma or college diploma is a better way of categorizing education than by years, it’s not a linear relationship with most other variables.

I did run some of the analyses myself because I was curious and I will say is that the preliminary results are very, very interesting. I am looking forward to their presentation.

So, that is the plus of open data  – real data, real experience and questions the students really want to answer.

The minus – well, it took me a lot of time to locate and download the data. The data set for the study on African-American women I had on my computer, but the one on school crime I had to track down and it still wasn’t exactly what they originally planned – although it ended up working perfectly.

A second minus is that SAS On-demand is SLOW. It is several times better than it was originally. When first released it was so slow as to be useless. Now, based on, I don’t know what – sunspots – there are times it works perfectly, just a tiny bit slower than SAS on my desktop, and other times when it is really tedious. I’m sticking with it this semester because it is a) free, b) used in lots of organizations where my students may work one day and c) showing the potential to be really useful.

A third minus is that one of the students has not been able to get it to install and run, for reasons I cannot figure out. I referred him to SAS Tech support today.

If the professor teaching a statistics, research methods or data mining course did not have a lot of SAS programming experience, I think using SAS on-demand would be a challenge.

So — why bother? I think it comes back to the study one group is doing on African-American women and marriage, another group is doing on bullying in school, a third group is doing on the relationship between arts education and academic achievement using the National Educational Longitudinal Study.

Years ago, when my daughter, Maria Burns Ortiz, was a little girl, I asked her how science class was at her new school. We had recently moved and she had gone from a magnet school for gifted children to a regular parochial school. She said, “We don’t have science.”

I corrected her, “Mija, you must have science. You got an A in it on your progress report.”

She said, “No, we don’t have science at this school. We just read about it.”

So, that is why I am putting together data sets at 1 a.m. My students have statistics, they don’t just read about it.

# Why I Don’t Have Minions

September 20, 2011 | 1 Comment

Admit it, more than once you have thought to yourself,

Wouldn’t it be convenient about now to have some mindless minions to do my bidding?

I’d always thought if this whole statistical consulting thing didn’t work out, I could be an evil scientist. I mean, I already went to the trouble to get a Ph.D. so I have got half of the Dr. Evil thing down, the Dr. part. You’d think that would be the more difficult part,  no?

According to Merriam-Webster , a minion is a servile dependent, follower, or underling. I’ve often been asked why I don’t have someone else write programs to delete all out-of-range data, reverse-code answers where needed, re-code answers such as “not applicable” to missing, check that the distributions meet assumptions of normality and so on.

I think that is a very bad idea.

The first reason having someone else do the “menial work” of data cleaning is a bad idea is that all of your analysis is going to rest on the assumption of that data cleaning having been done correctly. In cases where it is my reputation (or other parts of me) on the line, I want to be sure that data was coded and scored correctly. If not, everything after that was a waste of time. It’s not uncommon to find differences in conclusions from different studies. This can be due to a variety of reasons – differences in age, education, health, occupation and a thousand other factors in your sample, differences in type of measure – GPA versus standardized test versus high school graduation and on and on. I wonder, though, if sometimes those differences are not due to some of the studies’ data just being entered, coded and scored incorrectly.

The second reason for getting down and dirty with your own data is that you often start to have ideas as you play with variables. What is the relationship between mother’s education and school failure? What about father’s education? What about taking the maximum (or minimum) of the two? After all, there are programs for students who are “First-generation college students”, that is, students who neither of their parents went to college. As you play with your data, you can see new relationships, develop new questions.

Personally, if I’m going to delegate tasks, I prefer to do it at the end, having someone else make a powerpoint presentation or graphs of the data, maybe doing some of the final analyses. If any “minions” are doing this work after I have spent considerable time working with the data set, then I should be able to spot any mistakes and say, “You know, I don’t think your finding that mother’s education is unrelated to school failure fits with what I found in other analyses, let’s take a look at how you arrived at that.”

Besides, just look at what happens when evil scientists DO use minions. Think how Young Frankenstein would have come out differently if the doctor had gone and gotten the brain himself and assigned Igor a task of say, picking out the monster’s wardrobe instead.

# Census in Black & White: What I wondered about lately

August 22, 2011 | 2 Comments

The census now allows more than one race to be checked. For many years, friends of mine in inter-racial couples when they registered their children for school would check the “Other” box for race, rather than pick black or white.

Although an individual’s census form responses are confidential, you certainly are free to tell anyone what you put. In response to an inquiry, the white house spokesperson said that President Obama had checked only “African-American or Black“, even though his mother is white.

Now that you can select both black and white as race, I wondered how many people did. Unlike normal people who wonder about these things, I decided to download all 3,030,728 records from the 2009 American Community Survey to find out. Once I downloaded the survey and read it into SAS, I produced the chart below. I was quite surprised to see how few people checked both black and white. As you can see, it was less than 1%.

The SAS code to create this chart is shown below. You might think this is a ridiculous amount of work to create one chart and you could do it way easier in Excel. You’d be correct except for two things.  One, I know that earlier versions of Excel no way could you read in 3,000,000+  records. Even if you can do it now I’ll bet it’s painfully slow. Two, most of these options or steps only need to be done once and I was doing multiple charts. The AXIS and PATTERN statements only need to be specified once.

If you DO want to create your chart in Excel, you could just do the first part, the PROC FREQ, and then export your output from the frequency procedure to a four- record file and do the rest in Excel. There is no need to get religiously attached to doing everything with one program or package.

``` PROC FREQ DATA = lib.pums9 NOPRINT;```

`TABLES racblk* racwht / OUT = lib.blkwhitmix ;`

`WEIGHT pwgtp ;`

``` DATA byrace ; SET lib.blkwhitmix ; IF racblk = 1 AND racwht = 0 THEN Race = "Black" ; ELSE IF racblk = 0 AND racwht = 1 THEN Race = "White" ; ELSE IF racblk = 0 AND racwht = 0 THEN Race = "Other" ; ELSE IF racblk = 1 AND racwht = 1 THEN Race = "Mixed" ; PERCENT = PERCENT/ 100 ; AXIS1 LABEL = ( ANGLE = 90 "Percent") ORDER = (0 to 1 by .1 ) ; AXIS2 ORDER = ("White" "Black" "Mixed" "Other" ) ; PATTERN1 COLOR = BLACK ; PATTERN2 COLOR= GRAY  ; PATTERN3 COLOR = BROWN ; PATTERN4 COLOR=WHITE  ; PROC GCHART DATA=byrace ; VBAR Race / raxis = axis1 maxis = axis2 SUMVAR= percent TYPE=SUM OUTSIDE= SUM PATTERNID = MIDPOINT ; LABEL  Race  = "Race"  ; ```

`FORMAT percent percent8.1  ;`

# Wilcoxon, Normality, Paired T-test & Smart Boys

May 22, 2011 | 2 Comments

Lately, I’ve been missing some of my former colleagues at the USC Medical School. This is not just because they are super-nice people, which they are, but also because they used to ask for different types of statistics, and I do think variety is the spice of life – except for in marital relationships where it is the spice of divorce courts.

Many of the physicians I’ve worked with deal with small sample sizes, especially if they are just looking at their own practices. Not wanting to violate any confidentiality agreements here, let’s make up a disease, say, fear of naked mole rats, or nakedmoleratophobia . In the normal course of one’s practice, you may only see a couple dozen people a year who have this malady.

Thus, many of the medical studies on which I have been a consultant involve small sample statistics. I haven’t done a lot of that lately, so as I was coveting a Mann-Whitney U (used in place of an independent t-test) or a Wilcoxon signed rank.

I ran through what did I have that could be a small sample and produce an answer to a question that interested me, and here is what I have been thinking about —

I’ve heard most of my life from most of the experts that allowing gifted children to skip grades and attend school with children older is a bad idea.  Short version – my brother and I both started college at 16. I thought it was a terrific advantage and two of my daughters began college before age 18. My brother thought it was a bad idea and neither of his children began college early.

I got to wondering specifically about males who were accelerated, since you hear that boys mature later. Another common belief is that boys are less verbal. While I was wondering, I noticed in the TIMSS data that there were 31 males who were younger than the typical age range – that is younger than 13.5 at the time of the test.

I wondered if, given the bias against promoting children, and especially boys, whether these young boys would be exceptionally advanced. I also wondered if they would be doing relatively better in mathematics than in science since, based on my completely casual observations, it seemed like middle school science requires a lot more reading than middle school mathematics does.

Both mathematics and science on the TIMSS are measured on a scale with a mean of 500, so I thought I could compare these using a Wilcoxon signed rank test. In case you didn’t know, this is a non-parametric test used with small sample sizes with related measures. Kind of like a paired t-test for non-normal data.

There are all sorts of statistical packages you could use to do this, and with small sample sizes like the one I have, you can even do it by hand. I happened to use SAS. I was going to try it with SPSS also but that would have required moving at least four feet to the computer and desk behind me. (Yes, my office does have two desks and two computers. What of it?)

It’s quite simple, really. You create a difference score by subtracting one variable from the other and then do a PROC UNIVARIATE. (This page from the University of Delaware gives a few other ways to do it. It also has a picture of a turkey’s head, which is something you don’t see that often. You also don’t hear much from Delaware. They are awfully quiet there. They are probably up to something.)
data smartboys ;
set sm ;
where itsex = 2 and round(bsdage) = 13 ;
diff = BSMMAT01 – BSSSci01 ;
proc univariate data = smartboys ;
var   diff ;

This gives me the following:

Tests for Location: Mu0=0

Test           -Statistic-    —–p Value——

Student’s t    t  -0.07961    Pr > |t|    0.9371
Sign           M      -1.5    Pr >= |M|   0.7201
Signed Rank    S       -24    Pr >= |S|   0.6458

Plus a bunch of other stuff.

Well, clearly, there is a non-significant difference between their scores in mathematics and science. This isn’t very surprising when you learn that their average score in mathematics is 535.7 and in science 537.2 . So, it is a really small difference and not at all what I expected. Also, from looking at the PROC UNIVARIATE output for the mathematics and science scores, it was obvious that the distributions were quite normal and I could have gone ahead and used a paired t-test. When I looked at the t statistic, shown above and helpfully included as part of the univariate output, it can be seen that the difference is even less significant.

HOWEVER — and here is where it is useful and highly recommended to know something about your data – it turns out that the mean scores for the U.S. are anything BUT identical. In fact, the mean for U.S. students in mathematics is 508 with a standard deviation around 77, while the mean for science is about 520 with a mean of 84. So, these young boys are about .37 standard deviations above their peers in mathematics and about .23 standard deviations higher in science. In fact, when I compared them to the other students, these boys WERE significantly higher than their peers in mathematics but not in science.

data testboys ;
set lib.statsfile ;
if itsex = 2 and round(bsdage) = 13 then smb = 1 ;
else smb = 0 ;
proc ttest data = testboys ;
class smb ;
var bsmmat01 bsssci01 ;

I had thought, given that there seems to be a prejudice against starting school early or skipping grades, both in general and especially for boys, that these boys would have to be amazingly ahead of their peers. As you can see, that isn’t the case. Yes, they were ahead, and yes, in mathematics it was statistically significant, but they weren’t far out there on the right of the normal curve.

On the other hand, most of them were doing quite fine, thank you, and being youngest in their classes didn’t seem to be affecting them in any negative way, at least, not academically.

Of course, since it did turn out that the data were quite normal, I could have just simply done a paired t-test, as so:

proc ttest data = smartboys ;
paired BSMMAT01 * BSSSci01  ;

Of course,  this will give me the EXACT same result as for the t-test in the univariate output above, with one less step because I don’t need to use a data step and create a variable which is the difference between the two.

However, I got to do my Wilcoxon signed rank test, I got an answer to my questions, in fact, for the question of math vs science, I got two answers, and they both agreed. On top of it all, the world’s most spoiled 13-year-old received a letter today telling her that she was accepted for the Summer Scholars program, despite not being 12, or a boy, (which since it is a program for high-achieving girls, actually worked in her favor).

So, I am satisfied and fulfilled. It’s just another sunny day in paradise.

# Data Quality Macro Explained

May 13, 2011 | 1 Comment

Yesterday, I talked about using a macro for beginning checks on data quality and I promised to explain it today. So, here we go…
``` options mstored sasmstore = maclib  ;```

If you want to store your macro, you need to use two options in the OPTIONS statement.

MSTORED that you want the stored compiled macros in the library specified after SASMSTORE = .

Why you need two options, I don’t know.

`LIBNAME MACLIB "C:\Users\MyDir\Documents\My SAS Files" ;`

*** That’s the directory where my macro will be stored.
%macro dataqual(dsn,idvar,startvar,endvar,obsnum) / store ;

*** This creates a macro named dataqual  that will take parameters for

dsn = data set name

idvar = the subject identifier in the data set . This is something like social security number, telephone number  (if you’re a phone company), customer number or other variable that should NOT have duplicates.

startvar = this is the first variable in the data set that I want to get descriptive statistics on

endvar = this is the last variable I want descriptive statistics on

obsnum =  the number of variables in the data set

The / store  is an option that tells SAS to store this macro, and it will be stored in the directory specified after the sasmstored option and matched in the LIBNAME statement.

```Title "Duplicate ID Numbers" ; Proc freq data =  lib.&dsn noprint ; tables &idvar / out = &dsn._freq (where = ( count > 1 )) ; format &idvar ;```

*****    This will create a frequency distribution but not print it (that ‘s the noprint option on the PROC FREQ statement ). This is important to remember because many of my data sets will have hundreds of thousands or millions of records, each with a unique identifier. It will output the duplicate values to a data set named &dsn._freq , with, of course &dsn replaced by whatever name I give it.

Since many of the public data sets I use have formats for every value, and I don’t want the formatted value used, the statement

`FORMAT &idvar ;`

will cause it to use the unformatted value for &idvar .

```proc print data = &dsn._freq (obs = 10 ) ; run ;```
*** This prints the first ten duplicate values. You want to be careful to put in that (obs = 10 ) just in case something went wrong in the FREQ procedure and it ended outputting everybody. For example, if you accidentally put (count = 1 ) . Then you may get 2 million records in your &dsn._freq data set and it would not be very good to print all of those out.

(Skipping story of dumping reams of green and white lined computer paper. If you’re my age, you have one of those stories of your own.)

```proc summary data = lib.&dsn mean min n std ; output out = &dsn._stats ; var &startvar -- &endvar ;```

**** This is going to create a data set , &dsn._stats with the mean, minimum,  n and standard deviation for each variable in your data set from &startvar to &endvar /

proc transpose data = &dsn._stats out = &dsn._stats_trans ;
id _STAT_ ;

This is going to transpose your dataset so that instead of five records with 200 or 500 or however many variables you have, instead you have 500 records with variables for _name_ , _label_ , minimum, mean, n and standard deviation.
data &dsn._chk ;
set &dsn._stats_trans ;
pctmiss = 1 – (n/&obsnum) ;
if min < 0 then neg_min = 1 ;
else neg_min = 0 ;
if std = 0 then constant = 1 ;
else constant = 0 ;
if (pctmiss > .05 or neg_min = 1 or constant = 1) then output ;
Title “Deviant variables to check ” ;
proc print data = &dsn._chk ;
run;

**** This reads in the transposed data set and then does some quality checks – if the standard deviation is 0, more than 5% of the data are missing or there is a negative minimum value, the variable is output. Then, I get a listing of variables that are in some way warranting a second look. The columns will show the descriptive statistics, plus new variables that show the percent missing, whether the variable is a constant or has a negative minimum.

Title “First 10 observations with ALL of the variables unformatted ” ;
proc print data = lib.&dsn (obs= 10)  ;
format _all_ ;

run ;

****  You should always stare at your data. This prints out the first 10 values without formats, just so I can see what it looks like. I use a lot of public data sets that come with user-defined formats. The FORMAT _ALL_  statement removes all formats for this step and will print just the unformatted values.
Title “First 10 observations with ALL of the variables formatted ” ;
proc print data = lib.&dsn (obs= 10)  ;
run ;

*** This prints the first 10 observations with formatted values, obviously. Using formatted values is the default so I didn’t need a FORMAT statement.

Title “Deviant variables to drop ”  ;
proc print data = &dsn._chk  noobs;
var _name_ ;

****  This prints the names of all of those variables that have problems. I can copy this from the output window, type the word

DROP

paste the list of variables, add a semi-colon and I have my drop statement.

%put You defined the following macro variables ;
%put _user_ ;

**** This just helps with trouble-shooting. It will put the user defined macro variables to the log so it looks something like:

dsn :  studentgr8

idvar: IDSTUD

and so on.

run ;
%mend dataqual ;

*** and that is the end of my macro.

I hope you love it as much as I do. While it isn’t all fun sexy cool like doing a proportional hazards regression model or even a factor analysis, if your data are no good all of those fancy statistics are at best, wrong and a waste of time, and , at worst, wrong and a major blow to your career depending on whose life your wrong data screwed up.

« go backkeep looking »