In my copious spare time, of which I have none, I teach in the doctoral program at a nearby university.  They want me to use the library and keep up on research, both because it looks nice in the alumni newsletter and also so that when students ask me questions about current technologies or findings, I don’t shrug and say, “Your guess is as good as mine.”

That sort of thing makes students wonder whether  getting a PhD is really worth going into debt for the remainder of their lives.

So, the university kindly pays money to a whole bunch of different publishers just so usually ungrateful people like me can engage in such use.

The authors of the research would also like people like me to use their research. They’d like to be cited, because that helps make them look good to funding agencies and tenure review committees. They’d like to think that their Uncle Bob was wrong, that they are not wasting their lives studying something as practical as how many angels can dance on the head of a pin, and that people who are actually teaching school or designing products will use their work to make the world a better place.

What’s the problem? The problem is that between the authors of the research, who probably did not get paid, and the university library, which paid for access, there are a number of barriers thrown up by publishers. Here is what happened yesterday:

  1. Log into my campus account
  2. Go to the library web page and search ejournals for the articles I need
  3. Find article, click link to go to year
  4. Click link to go to issue
  5. Click article
  6. Get taken to publisher page
  7. Get asked to log in with campus ID again
  8. Read half of  article – get called away for meeting
  9. Come back to find out I have been logged out due to being away from the computer. Go through steps 1-7 again.
  10. After answering a couple of calls from clients and students, find I have been logged out due to inactivity. Go through steps 1-7 again.
  11. Finish first article, go to second article, which is published by different publisher
  12. Find out that even though I have a university id that I have now logged in with twice (not counting the two previous times I was logged out) I need to register for an account with this publisher and log into that
  13. Register, log in, read 20 pages. Eat lunch. Come back to find I have been logged out and now need to log into the campus account, go to the library web page, go back to the article and log into the publisher account AGAIN

There was more, but you get the idea. If I can, I download the resource on my computer but often the number of pages I can download is limited. The crazy thing is that all of this is required from someone who has a paid access to the articles.

DIRECTORY OF OPEN ACCESS JOURNALS TO THE RESCUE !

I recalled reading a draft of an article my brother had written and when I told him I’d rather blog because then people could at least read it, he mentioned he was publishing it in an open access journal.

heart-shaped box of chocolatesMy first stop was the Directory of Open Access Journals and now I am in love.

It was amazing. First of all, they had one journal that had lots of articles that were exactly what I wanted. The Journal of Research in Rural Education, if you are wondering. I got the wild impression from this journal’s website that they actually wanted me to be able to read the articles. Here is the unbelievably crazy thing that happened. I was able to search on the terms I was interested in, 150 results were returned, and when I clicked on a link — IT OPENED WITH THE ARTICLE.

thief

When I went to eat dinner and came back, an amazing feat of technological innovation had occurred – THE ARTICLE I HAD BEEN READING WAS STILL THERE! Apparently, unlike the other publishers, the folks at JRRE are not concerned that part of a band of roaming article sneak thieves prowling the rough neighborhood of Ocean Park will break into my office while I am having my jambalaya and read research on mathematics education in rural contexts without paying $9 per article.

The effect these overly zealous firewalls have had on me personally are a definite preference for anything that is open access. I did request a few articles via interlibrary loan and I’ll probably order a book or two. Given that the university is less than 10 miles away, it’s faster for me to pick up the material in bulk in print than to read it on line – which is just nuts!

There are a few articles and books I wanted because they were very specifically related to the work I am doing. However, for 90% of it, one article on fidelity of implementation measures is as good as another. So, the result will be that the work in the open access journals will get used and cited and the rest will not. I suspect we are seeing the beginning of a trend here.

 

Lovely young daughter(If you are too young to remember the song, “Money for nothing and your chicks for free”, I guess the title of this post is not nearly as amusing to you as it was to me, as my lovely daughter so unhelpfully pointed out.)

Since I whined yesterday about Codecademy not providing much explanation of the code in the Quickstart course (where not much is defined as none at all), I thought I should not be so hypocritical.

I am often posting some code and saying I will explain it later. I noticed some of those from 2009 or 2010. Well, 2012 is definitely later. I think I’ll work backward though and start with what I promised to explain “later” this week.

As I’ve rambled on here a lot, open data is a great idea, but it takes some work. I decided to post some of what I’ve been doing here with explanation. In part, this was motivated by a talk I had today with some researchers I’ll be working with over the next few months. Someone said, quite correctly,

“You read the journal article and they say  they did propensity score matching but they never tell you exactly how they did it, how they modified the code, which macro they used, because that’s not really the focus of the article. Unfortunately, when you go to replicate that model, you can’t because there isn’t enough detail.”

So, here in great detail from beginning to end is how I banged the data into shape for the analyses of the Kaiser-Permanente Study of the Oldest Old. These are data I downloaded from the Inter-university Consortium for Political and Social Research. (Free, open data).

The analysis was all done using SAS On-Demand for Academics.

LIBNAME mydata "/courses/u_2.edu1/i_123/c_2469/saslib" ;

The LIBNAME statement specifies the directory for my course on the SAS server. This is going to be unique to your course. If you have a SAS On-Demand account and you are the instructor, you know what this directory is. The “mydata” is a libref – that is, it is just used in the program to refer to the library a.k.a. directory. You can use any valid SAS name. Actually, I am not using anything in the class directory in this example, but I put it the first line in every program as a habit so when I DO need the data in the directory available, I have it.


OPTIONS NOFMTERR ;

This prevents the program from stopping when it cannot find the specified format. SAS Enterprise Guide is generally pretty forgiving about format errors, and I used a .stc file which should have the formats included, but I usually include this option anyway. If there are missing formats, you’ll get notes in your SAS log but your program will still run.


FILENAME readit "/courses/u_2.edu1/i_123/c_2469/saslib/04219-0001-data.stc" ;
PROC CIMPORT INFILE = readit LIBRARY = work ISFILEUTF8=TRUE ;

This imports the file, complete with formats. I rambled on about this in an earlier post. In short, because this particular file was created on a different system, you can EITHER have the formats OR have it in your permanent (i.e. course) directory, but not both. Click on the previous link if you need detail on .stc files or CIMPORT. Otherwise, move on.


PROC FORMAT ;
VALUE alc
3 = "1-2"
2 = "3-5"
1 = "6+" ;

This is creating a format for one of the variables in the data set. As you can see, the variable was coded 1= 6 or more drinks a day, 2 = 3-5 drinks a day. I want to change that format so the actual values print out, instead of “1″ for the people who had six drinks.


DATA work.alcohol ;
SET work.DA4219P1 ;

The previous CIMPORT step created that dataset DA4219P1 . I am reading it into  a new data set, named alcohol, that I’m going to change and create variables for my final dataset to analyze. Everything I am doing in this program COULD be done with tasks in SAS Enterprise Guide, but I found it more efficient to do this way.  These are both in temporary (working) memory.

ATTRIB b LENGTH  = $10. c1dthyr LENGTH = 8.0 c2dthyr LENGTH = 8.0;

In the ATTRIB statement, I am defining new variables and specifying the length and type. You don’t have to do this in a SAS program but if you don’t, SAS will assign length and type when it first encounters the variables, and it may not do it exactly the way you want.


IF alcopyr = 0 OR alcohol = 0 THEN amntalc = 0 ;
FORMAT amntalc alc. ;

The variable amntalc was missing for a large number of people, but most of those people had previously answered “no” to the questions asking if they drank alcohol in the previous year or ever in their life. If they said, “no” to either of those, I set the amntalc, which is how much they drink per day, to zero. This dramatically cut the amount of missing data. Then I applied the format.

death = INPUT(COMPRESS(dthdate),MMDDYY10. ) ;
b = "01/" || COMPRESS(brthmo) || "/" || COMPRESS(brthyr) ;
bd = COMPRESS(b) ;
brthdate = INPUT(b,MMDDYY10.) ;
lifespan = (death - brthdate) / 365.25 ;
lifespan = ROUND(lifespan,1) ;

 

All the above calculates two variables I actually care about and a couple of others I’m going to drop. Death is the date of death. Lifespan is how old the person lived to be. First, I read in the date of death, which had been stored as a text field, that’s what the INPUT function does, and the mmddyy10. gives it the format in which to read the data. I stripped out the blanks (that’s what the COMPRESS function does). Now I have the date of death.

The file doesn’t have actual birth days, just birth month and year. So, b = 01 plus the month and year – assigning everyone the birthdate of the first day of the month when they were born. The statement with birth date reads that as a SAS date. Now I am going to subtract the birthday from death date and divide it by 365.25 to give me how many years the person lived. Finally, I am going to ROUND the result to the nearest year.

Yes, I could have combined a lot of these statements into one. For example, I could easily combine the last two statements calculating lifespan. I did it like this because I use SAS On-Demand to teach and if my students peek at the program, which many do, it is easier for them to understand broken down like this.


if alcopyr = 1 then alcopyr = . ;
if smcigars = 1 then smcigars = . ;
if educ in (3,4,5) then education = 4 ;
else education = educ ;

The above recodes some variables. For a couple, “1″ meant the data were not available, so I changed that to missing. For education, I combined three categories so my data were the more typical categories of “less than high school”, “high school” etc.


IF cohort = 1 then do ;
IF death = . then c1dthyr = 12 ;
ELSE  c1dthyr = YEAR(death) - 1971 ;
yrslived = c1dthyr ;
dthy1 = YEAR(death) ;
end ;


else IF cohort = 2 then do;
IF death = . then c2dthyr = 12 ;
ELSE c2dthyr = YEAR(death) - 1980 ;
yrslived = c2dthyr ;
dthy2 = YEAR(death) ;
end ;


There were two cohorts, one for which data collection began in 1971 and one in 1980. I wanted the option to either analyze two different variables (and possibly split the data set later), so I created two variables named c1dthyr and c2dthyr. I also wanted one variable because I wanted to be able to compare survival rates by cohort, so I created a variable named yrslived. I was working with a student who was interested in deaths in a specific range of years, so I created variables dthy1 and dthy2 for her. The YEAR function returns the year part from a SAS date. A DO loop for each cohort took care of all of that.


DROP  b bd F_A_X01 -- I_AUTPSY MR1_CS1 -- A_F_DT1 D_A_DT2 -- H_DISDT4 MRCSDT1 -- MR3_DT4 IREVMVA -- NTIS5 LABsum1 -- E_MRDX Vis1y1 -- vis9y9  E_CR1 -- PRCS43  B_INTR -- MRCSDDT23 ;

The last statement drops a bunch of variables I don’t need. The somevar — othervar  notation with the two dashes in between includes all of the variables in order in the dataset from the first variable mentioned to the second. There were literally hundreds of variables I wanted to drop, so here they are all dropped. Now, I have the file I want and I am happy and ready to start running stuff for my first day of class.

 

I am pissed.

As a small business owner, I am feeling very, very disappointed that there is certainly some law out there that would impose penalties if I drove on over to Riverside County and bitch-slapped Darrell Issa. I’ve grown cynical enough in my old age and after having run a small business since 1985 that I am used to every politician under the sun spouting “Think of small business!” as a knee-jerk reaction to anything, whether their position is for it or against it. Usually, they are easy enough to ignore. Payroll taxes are not going to decide whether or not I hire people – business demand is. Health care – we’ve made that an option for our employees long before it was required by law.

This time, though, they are REALLY pissing me off. Let me tell you what the Research Works Act is and how it really does hurt my small business. As this succinct article by Janet Stemwedel on the Scientific American blog site explains well, not only does it require the American taxpayer (that’s me!) to pay twice for the same research, but also, the very people being protected and profiting are NOT those who produce the work to begin with.

Right now, if a person is funded by federal funds, say, the National Institutes of Health, they are required to submit the results of their research to PubMed’s repository within twelve months of publication. The idea is that if the public paid for this research then the public has the right to read it. Sounds fair, right?

In case you don’t know, rarely do authors get paid by journals.  I’ve published articles in the American Journal on Mental Retardation, Research in Developmental Disabilities, Educational and Psychological Measurement – to name a few. I’ve been a peer reviewer for Family Relations. For none of this did I get paid. That was fine. Almost all of the research I did was funded by federal funds and part of the grant proposals included dissemination – that is, publication of scientific articles. Fair enough. As a peer reviewer, I’m just repaying the service others have done in reviewing my work. Again, no problem.

Yet, in many cases, if I need a journal article for a grant or report I’m writing for a client, it is going to cost me $30 per article.  Contract research is a good bit of where the actual money comes into this company (you didn’t seriously think I made my living by drinking Chardonnay, spouting wanton programming advice and snarky comments, did you?)

The journal did not pay for the research to be conducted – the federal government did. The journal did not pay the author – the federal government did. The journal did not pay the reviewer – they volunteer.

SO WHY THE FUCK SHOULD I PAY $30 AN ARTICLE?

I just pulled up a random small project I had done recently for a client and there were seven articles in there that I would be charged $210 to have used. As I said, this was a small project, and I calculated it would have brought the price up 7.5%

Not long ago, there was a huge outcry about the city of Santa Monica adding a ten cent cost for a paper bag and banning plastic bags. “It will hurt small business!” people cried. Actually, I have always made a major effort to shop at our local businesses, I still do and it has not hurt any business anywhere that I can tell. You know what else I can tell you? That increasing the cost of a $3,000 project to $3,210 is a hell of a lot more significant than paying ten cents for a paper bag!

So what exactly is this bill doing? It is moving money from small businesses, like me,  and like my buddy, Dr. Jacob Flores, who runs Mobile Medicine Outreach and into the hands of large publishing companies, who not coincidentally gave a huge amount of money  to Democratic congresswoman Carolyn Maloney of New York. I’d like to bitch-slap her, too, but being on the opposite side of the country, it would be a lot less convenient for me.

You can read more detail in this article from the Atlantic, where Rebecca Rosen asks, “Why Is Open-Internet Champion Darrell Issa Supporting an Attack on Open Science? ”

As Danah Boyd points out on her blog, there is this new thing for sharing knowledge now, called the Internet and a major point of the Research Works Act seems to be to prevent it being used to share knowledge that I paid for with the approximately 50%  of my income (yeah 38% federal, 10% state) that I pay in taxes. And you know what, being a graduate from that great institution, the University of California, that enables me to make the money to pay these taxes, I don’t object to that.

What I DO object to is paying again for the same resources I already paid for once just because some lobbyists for large corporations lined Issa’s and Maloney’s pockets.

While it may not be legal for me to bitch-slap Issa it is certainly legal for me to go to the next California event where that lying-ass mother has the balls to stand up and claim to be helping California small business. Anyone who knows the next public event where he’ll be speaking, please hit me up.

One thing, though, I don’t think I’ll be going to any of his fundraisers. I think he’s gotten quite enough money from the publishing industry.

 

I’ve spent about 35 years messing around with computers based on the assumption that most discoveries are not preceded by “Eureka!” but rather,

“What the hell! May as well try it.”

Having my new computer pretty much dissolve in smoke (less than a month after I bought it!), I decided to continue my analyses of the Kaiser-Permanente data using SAS on demand . I had downloaded this as  a .stc file a while back, and used PROC CIMPORT to read it in, with all the formats, which I have mentioned before, is really, really easy to do.

I wondered what would happen if I uploaded the SAS dataset, you know, the .sas7bdat file  and the FORMATS catalog to the SAS On-demand server. Would it work?

Short answer – kind of.  SAS On-Demand is very forgiving when it comes to format errors. It appears to have the NOFMTERR option turned on by default. I’d leave it alone.  Look what happens when I do a simple table analysis. I get a table – sort of.

This table doesn’t use the formats.  No one reading it would know that 3 = “6+ drinks daily”. If it was just one format, I could use a PROC FORMAT and re-create it. No big deal.  There are hundreds of variables and thousands of lines of format code I’d have to write. My first thought of

“What the hell, I’ll just drag and drop the folder with the formats”

was a failure.In fact, despite the fact that Filezilla shows that my format folder was uploaded, it doesn’t actually show up in the SAS server at all.

Table on alcohol consumption with no formats

I suppose I could at this point go read some documentation or SAS user group papers – there is plenty out there -  on moving catalogs between systems. Seriously, where is the fun in that?

I still had the original .stc file lying around so I thought I would see if I could use PROC CIMPORT with SAS On demand.

I had read-only access to the SAS server because that is what the students have, and I wanted to replicate what a student could or could not do.

I uploaded the .stc file using Filezilla (nothing special about Filezilla, I just happen to like it).

When I looked in my the SAS library for my class, I did not see the file, even though all of the regular sas7bdat files showed up there, but I thought, what the hell, I’ll try the CIMPORT any way.

Then, I ran this code:

libname mydata "/courses/univ.edu1/i_432400/a_2469/saslib" access=readonly;
Filename readit "/courses/univ.edu1/i_432400/a_2469/saslib/04219-0001-data.stc" ;
proc cimport infile = readit library = work  ;

 

Where the part in the quotes in the LIBNAME statement is the library for my course on the SAS server.

and I got this error message:

ERROR: This transport file was created by an earlier release of SAS that did not record encoding.  If and only if you know this file was created from a SAS session running a UTF8 encoding, you may set the ISFILEUTF8 CIMPORT option to T (for TRUE) to tell us to proceed to import this file.  If it was not created by a UTF8 session, then you must import it in a session
encoding that is compatible with the session encoding that produced that file.

If and only if I know this file was created running a UTF8 encoding ? Are you kidding me? Despite all the best warnings from my mother not to touch things if you don’t know where they’ve been, I downloaded this off the Internet! It could have originally been encoded in Korean for all I know (it wouldn’t be the first time).

But then I said to myself,

Julia the computer gnome“Self, wait a minute here. If it is in fact UTF8 encoded and I say that is true, then it will work. And if it isn’t true, what’s the worst thing that could happen? A gnome with a candy-cane hat jumps out of my computer and slaps me? Not very damn likely.”

So, I added the option, ran the statement like this

proc cimport infile = readit library = work ISFILEUTF8=TRUE ;

My program ran, my  SAS log showed all the formats were output, when I ran analyses, my formats were there and used.

All was right with the world.

Just to make matters more complicated, let me throw in this  — if you had NOT made the library ACCESS = READONLY and tried to write to your class’s SAS library you would get a DIFFERENT error about the formats coming from a different system and it would not have worked.

My guess, and it is just a random guess, is that maybe the class library had formats that were inconsistent with the older ones being uploaded so there was a conflict, while, since the work library is just held in working memory, there was nothing there to be inconsistent with. I could like it up or Google it or something. But I didn’t.

Yes, it IS amazing that they give people with my level of maturity passwords to high performance computers. I find it hard to believe myself.

 

 

 

 

Glass of whisky. Public DomainWhen I met my second husband, the entire contents of his kitchen cupboards was a carton of Marlboro cigarettes, a bottle of Jack Daniels and a loaded Magnum revolver. Despite that, he was extremely healthy. His reasoning was that he was a big, tough guy while viruses and bacteria were little bitty things, so the nicotine and alcohol killed them before they could get around to making him sick.

In my continuing saga of mucking about with open data, I’ve been looking at the Kaiser Permanente data from the study of the oldest old. I’m interested in testing out the claims of a friend of mine who recently turned 65. He says that he may as well keep up drinking 8 or 9 beers a night because it’s too late to quit now. He’s already too old to die young.

The original data set had over 40% of the data missing on this variable but I noticed that people who said they did not drink were not asked this question. So, I added a couple of IF statements that set the number of drinks per day to zero for people who said they did not drink. This left me with four categories.

Window on right of Summary Tables window with format optionLittle nifty note for if you are using SAS Enterprise Guide, and you want your data to be presented in the table according to the formatted values, rather than the unformatted values, look to the right of the task window for summary tables and you’ll see an option to sort by and in what order.

This was useful for me because the default is to sort by unformatted values and the unformatted values were

  • 0 = never,
  • 1 = 6 or more,
  • 2 = 3-5 drinks per day and
  • 3 = < 2 drinks per day.

The results can be seen in the table below. In fact, people who did not drink at all, at least in the past year, had the longest life span, at 85.4 years. People who drank two or fewer drinks per day had a life-expectancy of 84.8, those who had 3-5 alcoholic drinks per day lived an average of 81.9 years and those who drank six or more did have the lowest life span, a still pretty old 81.7 years.

It’s worth noting that almost 20% of respondents had no data for this question and they had the longest lifespan of all at 85.8 years. I don’t know what to make of that other than that possibly those who tell researchers it is none of their damned business how much they drink have less repressed anger and stress and consequently live longer.

Table showing average lifespan by number of drinks per day

 

Were these differences significant? My next analysis was to do an ANOVA with lifespan as the dependent variable and the four alcohol consumption groups as the independent. I dropped the people with missing data.

There was a significant difference (F = 19.55, p < .0001), however, keep in mind that this is a fairly large sample meaning that even relatively minor differences will be statistically significant. In fact, drinking only explained about 2% of the variance in lifespan, and that’s assuming that there aren’t any confounding factors like women drinking less and also having greater longevity.

Another nifty tip for SAS Enterprise Guide. If you would like to export a graph in a project, say, so that you could post it on your blog, all you need to do is right-click on that graph and select SAVE PICTURE AS

When we look at the box plot produced by the ANOVA, it’s evident that there is very little difference between the last two groups.

We knew that from the table of means above, anyway.  Perhaps more interesting is that fact that a Tukey post hoc test showed all groups to be significantly different from one another EXCEPT for the last two groups.

In other words, for my friend who drinks 8-9 beers a night to see any increase in life expectancy he would have to drop his alcohol consumption to less than 25% of his current level. Simply cutting it in half would not really yield him a significant benefit.

Of course, I do suggest that from time to time and his response is:

Look, you’re not my mother. If I have any need of a woman telling me what to do, I already get it fulfilled when I visit my mother three times every week. She’s  88 years old.

I wonder how much she drinks?

 

 

In my analysis of data on the oldest old from the Kaiser Permanente study, I started with cigar smoking because I have a friend who turned 65 last month. His annual physical went like this:

Doctor: Do you still smoke cigars?

Jim: Yep.

Doctor: Do you still drink 8 or 9 beers every night?

Jim: Yep.

Doctor: Are you going to quit?

Jim: Nope.

Doctor: Well, see you next year.

You know how old people always tell you that it doesn’t matter if they quit now, because they’re too old to die young anyway? I got to wondering if that was true, if people who were smoking and still alive at age 65 would be any more likely to die. (We’ll save the 8 or 9 beers thing for later.)

My first analysis was to look at the data by gender because I suspected that men were more likely to smoke cigars and I knew that women have  a longer life expectancy. When I did a table analysis by gender and cigar smoking using SAS Enterprise Guide, I found that only one of the 288 cigar smokers in the sample was female. So, it was clear to me if I was looking at the impact of cigar smoking on mortality I should probably only look at men.

The average age at death for men who smoked cigars (N=175) was 84.2 years, compared to 84.2 years for men who did not smoke cigars (N=1,147). Since there was some rounding here, p does not equal exactly 1.0 but rather p > .90.  So, it does not seem that quitting cigar smoking would improve your life expectancy given you have already made it to age 65.

However, that is only people who died. Perhaps not smoking will increase your probability of survival? Let’s just take a simple chi-square analysis comparing people who smoked cigars to those who did not and seeing what is the probability they died over the nine years they were followed.

Of the cigar smokers, 13.2% died during the nine years participants were followed, compared to 13.1% of men who did not smoke cigars (chi-square = .07, p > .78).

Still, I am not convinced. Maybe cigar smokers were older, or less likely to be overweight or drink (well, I doubt that one).  Anyway, being the good little statistician, I ran a logistic regression with death as the dependent variable and BMI, education, age at the beginning of the study (baseline), whether or not the subject had ever drank alcohol and whether or not he smoked cigars as the predictors.

Table of Type 3 effects

As you can see, the only two significant factors were age (older people were more likely to die in the next nine years – not surprisingly since the study began with people 65 and over) and your education. I’m guessing education is a proxy for social class.

So, it looks as if Jim is okay to keep smoking his cigars, given he has made it this far. I’m still not convinced about the 8 or 9 beers, though. That’s my next analysis.

 

(Yes, that title does sound like a lot of the spam comments I get. )

Last year, at the Gov. 2.0 conference in Santa Monica, Jean Holm, from NASA spoke about some of the opportunities for open data. I left with mixed feelings. On the one hand, the best examples she gave were, I thought, of  ”semi-open” data, that is, a term I just made up for having more openness of data within an organization.  In one example, there would be a database of the capabilities of researchers within NASA, so if I was a NASA electrical engineer and I had an idea for designing a better electrical system for a lunar module, I could find out who had related expertise in hardware design, reliability testing, etc. That is a great idea, but it also makes me wonder to what extent open data within an organization would be put to use. It depends a lot on the organization.  Many large institutions – whether corporate, government, non-profit or educational – are not very excited about people going around the usual chain of command or departmental structure, no matter how many times they chant, “Think outside the box!”

Many times, both on this blog and elsewhere, I’ve questioned the probability of useful discoveries coming from open data unless the individuals doing the analysis have some knowledge of statistics, programming and the structure of the data actually being used .  If we have ten thousand each people doing 100 analyses, how do we decide which half of the 100,000 statistically significant results is important and which are in the group of 50,000 statistically significant results we would expect to occur by chance with 1,000,000 analyses?

Ten thousand people doing 100 analyses = 1,000,000 results.  One of those will have a p-value of .0000001 just by chance. A one in a million coincidence happens one time out of a million, right? So, let’s say we get three of those p < .0000001 events. How do we know which ones matter?

 

She said, sure you can have people correlating everything and come up with nonsensical relationships like between number of flashlights sold and solar flare activity. Presumably, somewhere “out there” are scientists or consumers of data or someone who will be able to identify the real findings from the flashlight sales- solar flares relationships. Having read a lot of academic journal articles that appear to have been both written and edited by people who were either not paying attention, inebriated or both, I am not so optimistic.

So, I decided to do an experiment and see just how far I could get with some samples of open data. The first data set I chose was from the Kaiser-Permanente study of the oldest old. This is actually two data sets.

One of the reasons I chose these data are that they come with pretty comprehensive documentation. For example, after reading through several hundred pages, I knew that the first data set was the master file and the second was a hospitalization file. In my experiment to see if I could find anything useful in here at all (other than what had already been published), I decided to use just the master file.

A second reason for selecting the oldest old study is that there are some published statistics I can verify my results against to see whether I have the data read in and coded correctly from the beginning. For example, the number of deaths I had in the first cohort (1,565) and second cohort (1,751) matched their figures exactly.

I did not start out with any preconceived notions other than the general public, assumptions like the older people were at the beginning of the study, the more likely they were to die before it was over.  While I’ve worked on some health statistics studies in the past, I am not a physician. This is one reason I used the master file, instead of the hospitalization one. I know what acute myocardial infarction is but I could not really generate much in the way of hypotheses about it nor the accuracy of diagnosis.  On the other hand, dead or not dead is pretty objectively measured.

I started with cigar smoking because I have a friend who turned 65 last month. His annual physical went like this:

Doctor: Do you still smoke cigars?

Jim: Yep.

Doctor: Do you still drink 8 or 9 beers every night?

Jim: Yep.

Doctor: Are you going to quit?

Jim: Nope.

Doctor: Well, see you next year.

You know how old people always tell you that it doesn’t matter if they quit now, because they’re too old to die young anyway? I got to wondering if that was true, if people who were smoking and still alive at age 65 would be any more likely to die. (We’ll save the 8 or 9 beers thing for later.)

My first analysis was to look at the data by gender because I suspected that men were more likely to smoke cigars and I knew that women have  a longer life expectancy. When I did a table analysis by gender and cigar smoking using SAS Enterprise Guide, I found that only one of the 288 cigar smokers in the sample was female. So, it was clear to me if I was looking at the impact of cigar smoking on mortality I should probably only look at men.

The average age at death for men who smoked cigars (N=175) was 84.2 years, compared to 84.2 years for men who did not smoke cigars (N=1,147). Since there was some rounding here, p does not equal exactly 1.0 but rather p > .90.  So, it does not seem that quitting cigar smoking would improve your life expectancy given you have already made it to age 65.

However, that is only people who died. Perhaps not smoking will increase your probability of survival? Let’s just take a simple chi-square analysis comparing people who smoked cigars to those who did not and seeing what is the probability they died over the nine years they were followed.

Of the cigar smokers, 13.2% died during the nine years participants were followed, compared to 13.1% of men who did not smoke cigars (chi-square = .07, p > .78).

Still, I am not convinced. Maybe cigar smokers were older, or less likely to be overweight or drink (well, I doubt that one).  Anyway, being the good little statistician, I ran a logistic regression with death as the dependent variable and BMI, education, age at the beginning of the study (baseline), whether or not the subject had ever drank alcohol and whether or not he smoked cigars as the predictors.

Table of Type 3 effects

As you can see, the only two significant factors were age (older people were more likely to die in the next nine years – not surprisingly since the study began with people 65 and over) and your education. I’m guessing education is a proxy for social class.

So, it looks as if Jim is okay to keep smoking his cigars, given he has made it this far. I’m still not convinced about the 8 or 9 beers, though. That’s my next analysis.

 

Before the semester began, I debated about requiring SAS on-demand for my statistics course. In fact, after giving it some thought, I decided to make use optional rather than mandatory.  One reason for my hesitation was uncertainty about basing a major part of students’ grades on a project requiring an untested software package. I could see the possibility for disaster. Although it took a good bit of my time to prepare, that was NOT a major issue for me. When I was a full-time faculty member I was constantly frustrated by feeling I did not have enough time to do the best possible job for each of my classes and students. Now, by choice, I teach only once or twice a year, at most.

No, my other concern was that I might be requiring too much of the students. Many of them had never had a statistics course before this class. Out of 19 students, at most two or three of them had as much as one semester of Calculus. I breezed through descriptive statistics, covered correlation, ANOVA, multiple regression and logistic regression in depth, and touched on mixed models, survival analysis, factor analysis and a tiny bit of structural equation modeling and hierarchical models. On TOP of all of that, they were going to have to learn at least enough about SAS to run analyses on actual data and give a conference-type of presentation. It has been over a quarter of a century since I took my first graduate course in statistics. (That was back when people went to graduate school in their twenties and that was all they did. I know that seems quaint to you all today.)  Maybe this is going to make me sound like one of those old fogeys who claim to have walked to school in the snow, uphill both ways.

Trees in the snowStill, the truth is that graduate school has become watered down over the past few decades. Professors are supposed to understand that students “have to work” and are not expected to give as much in the way of assignments so as not to unduly burden the paying customers – er, students. Students at many schools are either subtly or openly encouraged to “go hire a statistician” to help with their dissertations.

Honestly and truly, when I was in graduate school, I did not even have A CALCULATOR to do the homework problems in statistics because calculators were very expensive and I started graduate school with a preschooler and had two more babies my first two years. As my advisor grumpily said to me,

“Listen, I’m Catholic, too, but there’s such a thing as taking it to extremes.”

I was an industrial engineer and programming with SAS before I went back for a Ph.D. so I actually would telnet (remember telnet?) into the university server and run SAS programs to check my statistics homework problems, because graduate students got X number of hours on the computer for free (remember PAYING for computer time?)

I thought about it for a while and concluded,

“Screw it! These are doctoral students at a selective university. They’re getting a doctorate, and they’re smart. They ought to be able to do the work and they WILL learn something and get their money’s worth out of this course, whether they want to or not.”

As I said, this experiment could have turned out to be a disaster in many ways. The software could have not worked. The students could have complained to the administration about the work load. The administration could have told me I was being unreasonable.

It turned out that the software did require some advance preparation and extra work during the course. A lot of pre-processing of open data sets was done by me. However, by the third week of the course, the students had split into five groups for their research projects and every group had at least one person with a computer with SAS On-demand installed. Four out of the five groups ended up using SAS On-demand for their research. I strongly encouraged them to submit their research for presentation at the Western Users of SAS Software conference in September or the Los Angles Basin SAS Users Group this summer.

It also turned out that the students really DID want to get their money’s worth out of their courses. I heard from several of them off the record and let me just say that really bright students know whether they are being challenged or just passed through and their tuition checks cashed and they appreciated the former.  This may well be because they are all working adults and could see the possibility of applying SAS skills and statistical analysis in a work setting, and they also see the competitive environment for employment right now.

The administration seems happy enough. I still get my checks direct deposited and invited to departmental parties, so I guess that is a good sign.

It WAS a happy adventure and the main reason, as I stated in an earlier post, is because of the kinds of research my students were able to do.

Let me just give you one example of what came out of this semester:

One group was interested in testing the hypothesis:

Are African-American women less likely to get married if they have more education?

The group members – three women, two of them African-American, thought that the answer was, “Yes”. Their first reason was that they thought some men might be intimidated by women with more education, and that statistics showed African-American women were obtaining degrees at a higher rate than men. Also, they thought women who had degrees might be less willing to “settle”, that they wouldn’t feel like they had to get married, so would be more likely to stay single.

My husband, the real-live rocket scientist (now retired as of a few weeks ago), disagreed vehemently when I told him about this. Here is evidence you are doing interesting research – that your professor discusses it at home and it leads to debates. He said they were looking at it from a woman’s perspective. As a man, he wanted  a woman with an education, someone who was not boring and not looking at him as a meal ticket.  He said that the women in the group were looking at it from a female perspective – women with more education may feel less need to get married. He thought they failed to consider the male perspective, which is that more men might WANT to marry them if they had more education.

So…. how did it all come out?

Bar chart of married mean by educationAs you can see, he was right. Of the women aged 18 – 65 years, 43.9% of those with a graduate degree were married versus 39% of those with less than a high school diploma.

Age is a confound here, however.  Education has been rising for African-American women over the past few decades, so older women are more likely to be married (having had more years to get married) and less likely to obtain higher education.

So, the group conducted a logistic regression with marriage as the dependent variable and education, age and wages earned in the past year as the independent variables. They found that education was still a positive factor in predicting marriage after controlling for age and income.

The students used education as both a categorical variable and a continuous variable and found the same result.

Because this made me curious, I re-analyzed the data several ways, using women from age 16 and up, then age 18 and older. (Seriously, this is California, who gets married at 16?). I looked both at currently married (yes/no) and ever married (yes/no) and education was still significantly (  p < .0001) positively related to marriage.  Earned income also consistently showed a positive relationship with the probability of marriage.

So, we have scientific evidence – men like smart women. Successful ones, too.

This is just one example of four groups that presented using SAS On-demand. I’ll try to get around to discussing others later.

Hopefully, you’ll see them at WUSS. Hey, if you’re one of those men seeking smart women, you should look them up – I know at least two of them are still single!

 

 

 

It is not every day that my refrigerator provides insight into a statistical problem. My daughter gave me this magnet.

BIRTHDAYS ARE GOOD FOR YOU. STATISTICS SHOW THAT PEOPLE WHO HAVE THE MOST BIRTHDAYS LIVE THE LONGEST.

which led to my thoughts on life expectancy using open data. Kaiser Permanente collected data on two cohorts of patients, those who were 65 years old or older in 1971 and in 1980.  After having published some research supported by the National Institute on Aging, on topics of interest to themselves and their funding agency, like cardiovascular disease, the investigators made their data available through ICPSR where I downloaded it.

I had read elsewhere that the life expectancy had increased even within that decade. If that is true, I reasoned, looking at the survival curves for people from the 1971 cohort (1) and the 1980 cohort (2) would show some differences.

 

 

 

 

 

 

When I look at the survival curves by strata

(

PROC LIFETEST DATA = saslib.death ;

STRATA cohort ;

TIME yrslived *dthflag(0) ;

)

I get the survival curves above and it is pretty clear they are the same. If anything, it looks like Cohort 2, those born later, actually had a slightly higher mortality near the end of the study.  For those of you who feel uncomfortable just eyeballing the curves, even when they are as close to identical as this, the Log-Rank test  for equality of strata = 1.27 (p > .25).

Real hand soapsAnd yet, on the other hand, when I did a t-test by age, I found that the 1980 cohort did live significantly longer, with those who turned 65 in 1971 having a mean age at death of 84.7 while those who turned 65 in 1980 had a mean age at death of  85.3  (p = .01 ) which leads to the conclusion that people from the 1980 cohort did live longer.

What’s going on here? This is the point where the getting to know your data part that I am always harping on comes into play. Note that Kaiser-Permanente said that they collected data on people who were 65 or older  in 1971 or 1980, not who turned 65.

In fact, the two samples were not the exact same age. The mean age of the 1971 sample was 75.7 and of the 1980 sample 76.1 . So, of that .6 year difference in lifespan, .4 of it existed before the study even started.

What difference would that make? Well, let’s go back up to my refrigerator magnet. What does the fact that someone has lived to 65 tell you? Most unequivocally that they did not die of anything at an age earlier than 65. They weren’t killed in the Vietnam War when they were 19 years old, in a car crash with a drunk driver when there were 33, from colon cancer at 56. Because 100% of the population of people who live to 65 have escaped these hazards for the first 65 years of life, they are NOT representative of the population in terms of life expectancy. This is why when you read articles they have statements like, “For an American male who has lived to age 65, life expectancy is ….”

The qualifying phrase is necessary because those who have had more birthdays already are expected to live longer than the general population.

So, I pulled out those who were 65 when the study started and looked at the survival curve

SURVIVAL CURVE FOR PATIENTS AGED 65

COHORT 1 (1971) AND COHORT 2 (1980)

Survival curve 65 at start of study

A t-test of years lived for those in Cohort 1 versus Cohort 2 using only the 290 subjects who were 65 years old at the start of the study produced a very non-significant t-value of .56 (p > .50) .

T-tests for subjects at age 75 and age 85 produced similar results.  So, based on these data, the answer to the question of at least whether patients of Kaiser Permanente have increased in life expectancy over the 1970s is, “No”. This isn’t a comment on Kaiser Permanente one way or the other, merely an observation that it is unlikely that their patients are completely representative of the population.
Just an aside, a million points to people who put their data on the web and open to all comers. This shows two traits I admire. The first is generosity, allowing someone else to benefit from your efforts in collecting the data, with no expectation of return. The second is courage. It takes a good amount of courage to publish your results and then make your data available so anyone who wants can re-analyze the data and perhaps come up with a competing conclusion. So, props to you.

P.S. You can buy the hand soaps on etsy. I have no affiliation with them, I just thought they were funny.

 

Let the buyer beware – that phrase certainly applies to open data, as does the less historical but equally true statement that students always want to work with real data until they get some.

Lately, I have had students working with two different data sets that have led me to the inevitable conclusion that self-report data is another way of proving that people are big fat liars. One data set is on campus crime and, according to the data provided by personnel at college campuses, the modal number of crimes committed per year – rape, burglary, vehicle theft, intimidation, assault, vandalism – is zero. Having taught at a very wide variety of campuses, from tribal colleges to extremely expensive private universities, and never seen one that was 100% crime free, I suspected  – and this is a technical term here – that these data were complete bullshit. I looked up a couple of campuses that were in high crime areas and where I knew faculty or staff members who verified crime had occurred and been reported to the campus and city police. These institutions conveniently had not reported their data, which is morally preferable to lying through their teeth, with the added benefit of requiring less effort.

Jumping from there to a second study on school bullying, we found, as reported by school administrators in a national sample of thousands of public elementary, middle and secondary schools, that bullying and sexual harassment never, or almost never, occur and there are no schools in the country where these occur on a daily basis. Are you fucking kidding me? Have you never walked down the hall at a middle school or high school? Have you never attended a school? What the administrators thought to gain or avoid by lying on these surveys, I don’t know, but it certainly reduces the value of the data for, well, for anything.

So …. the students learn a valuable life lesson about not trusting their data too much. In fact, this may be the most valuable lesson they learn, Stamp’s Law

The government are very keen on amassing statistics. They collect them, add them, raise them to the nth power, take the cube root and prepare wonderful diagrams. But you must never forget that every one of these figures comes in the first instance from the village watchman, who just puts down what he damn pleases.

From an analysis standpoint, this is my soapbox that I am ranting on every day. Before you do anything, do a reality check. If you use SAS Enterprise Guide, the Characterize Data task is good for this, but any statistical software, or any programming language, for that matter, will have options for descriptive statistics  – means, frequencies, standard deviations.

This isn’t to say that all open data sucks. On the contrary, I’m working with two other data sets at the moment that are brilliant. One used abstracts of medical records data over nine years plus state vital records to record data on medical care, diagnoses and death for patients over 65. Since Medicare doesn’t pay unless you have data on the care provided and diagnosis, and the state is kind of insistent on recording deaths, these data are beautifully complete and probably pretty accurate.

I’ve also been working with the TIMSS data. You can argue all you want about teaching to the test, but it’s not debatable that the TIMSS data is pretty complete and passes the reality test with flying colors. Distributions by race, gender, income and every other variable you can imagine is pretty much what you would expect based on U.S. Census data.

So, I am not saying open data = bad. What I am saying is let the buyer beware, and run some statistics to verify the data quality before you start believing your numbers too much.

Next Page →