On my way home from the 2014 Western Users of SAS Software conference. When I was younger, I would go to every basic session trying to find something I could use that wasn’t over my head. As I got older, I went to the statistics sessions to see if there was anything new or more advanced I had not mastered yet.

Now that I’m really old, I just do my own presentations and then spend the rest of the conference wandering around to anything that looks interesting. Sometimes, the most interesting stuff is the questions after a session or just the random people I run into in the hallways.

Interesting stuff: Part 1 Data coolness

I had used the California Health Interview as example data for classes but I was not aware of the huge breadth of data available there. Also, if you are a researcher and ask them nicely they will create data sets for you, as long as the data are available and it can be done without violating confidentiality requirements. Check them out here.

http://data.ca.gov/

Say you wanted to chart the number of amputations per 100,000 workers over the past six years. The state of California has you covered.

amputation chart

That was pretty random, yes? Want Pneumoconiosis hospitalizations? Just check it out if you ever need health data, death data, politics – anyway, good resource.

Interesting stuff parts: 2, 3 & 4 which I hope to write about this week

Another random idea that I have certainly had before but never implemented … eek, I have to go check out but remind me it has to do with getting around the SAS Studio limit on ginormous data.

Also, F-test , p-value and r-square

And permutations, random data, bootstrapping and creating your own version of F-tests, t-values and p-values

As I mentioned yesterday, banging away at 7 Generation Games has led to less time for blogging and a whole pile of half-written posts shoved into cubbyholes of my brain. So, today, I reached into the random file and  coincidentally came out with a second post on open data …

aque

The question for Day 11 of the 20-day blogging challenge was,

“What is one website that you can’t do without? Tell about your favorite features and how you use it in teaching.”

Well, I’m a big open data fan and I am a big believer in using real data for teaching. I couldn’t limit it to one. Here are four sites that I find super-helpful

The Inter-university Consortium for Political and Social Research has been a favorite of mine for a long time.  From their site,

“An international consortium of more than 700 academic institutions and research organizations, ICPSR provides leadership and training in data access, curation, and methods of analysis for the social science research community.

ICPSR maintains a data archive of more than 500,000 files of research in the social sciences. “

I like ICPSR but it is often a little outdated. Generally, researchers don’t hand over their data to someone else to analyze until they have used it as much as their interest (or funding) allows. On the other hand, it comes with good codebooks and often a bibliography of published research. As such, it’s great for students learning statistics and research methods, particularly in the social sciences.

For newer data, my three favorites are the U.S. census site, CHIS and CDC.

census logocensus.gov data resources section is enough to make you drool when it comes to open data. They have everything from data visualization tools to enormous data files. Whether you are teaching statistics, research methods, economics or political science – it doesn’t matter if you’re teaching middle school or graduate school, you can find resources here.

 

Yes, that’s nice, but what if you are teaching courses in health care – biostatistics, nursing, epidemiology – whatever your flavor of health-related interests, and whether you want your data and statistics in any form from raw data to publication, the Center for Disease Control Data & Statistics section is your answer.

Last only because it is more limited in scope is the California Health Interview Survey site where you can get public use files to download for analysis (my main use) as well as get pre-digested health statistics.

It all makes me look forward to diving back into teaching data mining  this fall.

The Mplus website has free videos on using Mplus. The introductory course (not very introductory, really, because the the main topics are advanced factor analysis and structural equation modeling) has some pretty good resources. While it is not professional quality – just a video of a class  - and much of it is beyond the first or second statistics course most students take, about 22 minutes into the first course it has  a very nice discussion of regression and an example with a dummy variable. Think I’ll use that part for my next class.

There is also a good introductory discussion of path analysis in the same video at about 27 minutes. The reason I’m giving the points in the video is that it is VERY long and much of it is kind of irrelevant – why Mplus came about – or too advanced for introductory graduate courses.

The Inter-university Consortium for Political and Social Research has a wealth of resources. The one I have used the most in the past is data that you can download. However, you can also select a study and perform simple cross-tabulations. For example, I used the 500 Family Study to create a table of the frequency with which a respondent attended religious services with a child by the family’s religion, controlling for relationship of the respondent. I was assuming mothers would attend services with their children more often than fathers. I had no idea whether there were differences among religions – well, it turns out there are pretty dramatic differences.

You can use ICPSR not only to statistically answer questions, but also to look at the statistical methods themselves. For example, I did a cross-tabulation of age and sex using the Kaiser-Permanente study of the oldest old.  Each analysis produces both a table and a graph and here is the graph this one produced.

 

graph showing number at each age by gender

Now, we know that women live longer than men, but in this study we see there isn’t a real trend of males going down as a percentage as the age moves from 65 years on the left to 94 years on the right. In fact, 100% of the subjects over 92 years old were male. What does this tell you? Well, hopefully, after some discussion it will tell your students that this was not a random sample. In fact, it looks more like they deliberately sampled to get close to equal numbers of males and females, possibly within an age range, such as 80 to 90. It also suggests that there probably weren’t too many people over 92 years old in the sample. That finding that 100% of this age group is male was probably based on  three or four people. Having generated some hypotheses, you can go back to the ICPSR site and test them, requesting percentages for the row and total lets you know that 0.1% of the sample was over 92 years of age. Speaking of hypotheses, you can also got chi-square values, z-values and confidence intervals for your various research questions.

I like this site because unlike a lot of others that have pre-packaged and usually boring questions, “Will males choose red M& M’s more often than females?”, ICPSR has hundreds of studies with hundreds of thousands of variables, allowing students to pose questions that really interest them and find an answer.

In my copious spare time, of which I have none, I teach in the doctoral program at a nearby university.  They want me to use the library and keep up on research, both because it looks nice in the alumni newsletter and also so that when students ask me questions about current technologies or findings, I don’t shrug and say, “Your guess is as good as mine.”

That sort of thing makes students wonder whether  getting a PhD is really worth going into debt for the remainder of their lives.

So, the university kindly pays money to a whole bunch of different publishers just so usually ungrateful people like me can engage in such use.

The authors of the research would also like people like me to use their research. They’d like to be cited, because that helps make them look good to funding agencies and tenure review committees. They’d like to think that their Uncle Bob was wrong, that they are not wasting their lives studying something as practical as how many angels can dance on the head of a pin, and that people who are actually teaching school or designing products will use their work to make the world a better place.

What’s the problem? The problem is that between the authors of the research, who probably did not get paid, and the university library, which paid for access, there are a number of barriers thrown up by publishers. Here is what happened yesterday:

  1. Log into my campus account
  2. Go to the library web page and search ejournals for the articles I need
  3. Find article, click link to go to year
  4. Click link to go to issue
  5. Click article
  6. Get taken to publisher page
  7. Get asked to log in with campus ID again
  8. Read half of  article – get called away for meeting
  9. Come back to find out I have been logged out due to being away from the computer. Go through steps 1-7 again.
  10. After answering a couple of calls from clients and students, find I have been logged out due to inactivity. Go through steps 1-7 again.
  11. Finish first article, go to second article, which is published by different publisher
  12. Find out that even though I have a university id that I have now logged in with twice (not counting the two previous times I was logged out) I need to register for an account with this publisher and log into that
  13. Register, log in, read 20 pages. Eat lunch. Come back to find I have been logged out and now need to log into the campus account, go to the library web page, go back to the article and log into the publisher account AGAIN

There was more, but you get the idea. If I can, I download the resource on my computer but often the number of pages I can download is limited. The crazy thing is that all of this is required from someone who has a paid access to the articles.

DIRECTORY OF OPEN ACCESS JOURNALS TO THE RESCUE !

I recalled reading a draft of an article my brother had written and when I told him I’d rather blog because then people could at least read it, he mentioned he was publishing it in an open access journal.

heart-shaped box of chocolatesMy first stop was the Directory of Open Access Journals and now I am in love.

It was amazing. First of all, they had one journal that had lots of articles that were exactly what I wanted. The Journal of Research in Rural Education, if you are wondering. I got the wild impression from this journal’s website that they actually wanted me to be able to read the articles. Here is the unbelievably crazy thing that happened. I was able to search on the terms I was interested in, 150 results were returned, and when I clicked on a link — IT OPENED WITH THE ARTICLE.

thief

When I went to eat dinner and came back, an amazing feat of technological innovation had occurred – THE ARTICLE I HAD BEEN READING WAS STILL THERE! Apparently, unlike the other publishers, the folks at JRRE are not concerned that part of a band of roaming article sneak thieves prowling the rough neighborhood of Ocean Park will break into my office while I am having my jambalaya and read research on mathematics education in rural contexts without paying $9 per article.

The effect these overly zealous firewalls have had on me personally are a definite preference for anything that is open access. I did request a few articles via interlibrary loan and I’ll probably order a book or two. Given that the university is less than 10 miles away, it’s faster for me to pick up the material in bulk in print than to read it on line – which is just nuts!

There are a few articles and books I wanted because they were very specifically related to the work I am doing. However, for 90% of it, one article on fidelity of implementation measures is as good as another. So, the result will be that the work in the open access journals will get used and cited and the rest will not. I suspect we are seeing the beginning of a trend here.

 

Lovely young daughter(If you are too young to remember the song, “Money for nothing and your chicks for free”, I guess the title of this post is not nearly as amusing to you as it was to me, as my lovely daughter so unhelpfully pointed out.)

Since I whined yesterday about Codecademy not providing much explanation of the code in the Quickstart course (where not much is defined as none at all), I thought I should not be so hypocritical.

I am often posting some code and saying I will explain it later. I noticed some of those from 2009 or 2010. Well, 2012 is definitely later. I think I’ll work backward though and start with what I promised to explain “later” this week.

As I’ve rambled on here a lot, open data is a great idea, but it takes some work. I decided to post some of what I’ve been doing here with explanation. In part, this was motivated by a talk I had today with some researchers I’ll be working with over the next few months. Someone said, quite correctly,

“You read the journal article and they say  they did propensity score matching but they never tell you exactly how they did it, how they modified the code, which macro they used, because that’s not really the focus of the article. Unfortunately, when you go to replicate that model, you can’t because there isn’t enough detail.”

So, here in great detail from beginning to end is how I banged the data into shape for the analyses of the Kaiser-Permanente Study of the Oldest Old. These are data I downloaded from the Inter-university Consortium for Political and Social Research. (Free, open data).

The analysis was all done using SAS On-Demand for Academics.

LIBNAME mydata "/courses/u_2.edu1/i_123/c_2469/saslib" ;

The LIBNAME statement specifies the directory for my course on the SAS server. This is going to be unique to your course. If you have a SAS On-Demand account and you are the instructor, you know what this directory is. The “mydata” is a libref – that is, it is just used in the program to refer to the library a.k.a. directory. You can use any valid SAS name. Actually, I am not using anything in the class directory in this example, but I put it the first line in every program as a habit so when I DO need the data in the directory available, I have it.


OPTIONS NOFMTERR ;

This prevents the program from stopping when it cannot find the specified format. SAS Enterprise Guide is generally pretty forgiving about format errors, and I used a .stc file which should have the formats included, but I usually include this option anyway. If there are missing formats, you’ll get notes in your SAS log but your program will still run.


FILENAME readit "/courses/u_2.edu1/i_123/c_2469/saslib/04219-0001-data.stc" ;
PROC CIMPORT INFILE = readit LIBRARY = work ISFILEUTF8=TRUE ;

This imports the file, complete with formats. I rambled on about this in an earlier post. In short, because this particular file was created on a different system, you can EITHER have the formats OR have it in your permanent (i.e. course) directory, but not both. Click on the previous link if you need detail on .stc files or CIMPORT. Otherwise, move on.


PROC FORMAT ;
VALUE alc
3 = "1-2"
2 = "3-5"
1 = "6+" ;

This is creating a format for one of the variables in the data set. As you can see, the variable was coded 1= 6 or more drinks a day, 2 = 3-5 drinks a day. I want to change that format so the actual values print out, instead of “1″ for the people who had six drinks.


DATA work.alcohol ;
SET work.DA4219P1 ;

The previous CIMPORT step created that dataset DA4219P1 . I am reading it into  a new data set, named alcohol, that I’m going to change and create variables for my final dataset to analyze. Everything I am doing in this program COULD be done with tasks in SAS Enterprise Guide, but I found it more efficient to do this way.  These are both in temporary (working) memory.

ATTRIB b LENGTH  = $10. c1dthyr LENGTH = 8.0 c2dthyr LENGTH = 8.0;

In the ATTRIB statement, I am defining new variables and specifying the length and type. You don’t have to do this in a SAS program but if you don’t, SAS will assign length and type when it first encounters the variables, and it may not do it exactly the way you want.


IF alcopyr = 0 OR alcohol = 0 THEN amntalc = 0 ;
FORMAT amntalc alc. ;

The variable amntalc was missing for a large number of people, but most of those people had previously answered “no” to the questions asking if they drank alcohol in the previous year or ever in their life. If they said, “no” to either of those, I set the amntalc, which is how much they drink per day, to zero. This dramatically cut the amount of missing data. Then I applied the format.

death = INPUT(COMPRESS(dthdate),MMDDYY10. ) ;
b = "01/" || COMPRESS(brthmo) || "/" || COMPRESS(brthyr) ;
bd = COMPRESS(b) ;
brthdate = INPUT(b,MMDDYY10.) ;
lifespan = (death - brthdate) / 365.25 ;
lifespan = ROUND(lifespan,1) ;

 

All the above calculates two variables I actually care about and a couple of others I’m going to drop. Death is the date of death. Lifespan is how old the person lived to be. First, I read in the date of death, which had been stored as a text field, that’s what the INPUT function does, and the mmddyy10. gives it the format in which to read the data. I stripped out the blanks (that’s what the COMPRESS function does). Now I have the date of death.

The file doesn’t have actual birth days, just birth month and year. So, b = 01 plus the month and year – assigning everyone the birthdate of the first day of the month when they were born. The statement with birth date reads that as a SAS date. Now I am going to subtract the birthday from death date and divide it by 365.25 to give me how many years the person lived. Finally, I am going to ROUND the result to the nearest year.

Yes, I could have combined a lot of these statements into one. For example, I could easily combine the last two statements calculating lifespan. I did it like this because I use SAS On-Demand to teach and if my students peek at the program, which many do, it is easier for them to understand broken down like this.


if alcopyr = 1 then alcopyr = . ;
if smcigars = 1 then smcigars = . ;
if educ in (3,4,5) then education = 4 ;
else education = educ ;

The above recodes some variables. For a couple, “1″ meant the data were not available, so I changed that to missing. For education, I combined three categories so my data were the more typical categories of “less than high school”, “high school” etc.


IF cohort = 1 then do ;
IF death = . then c1dthyr = 12 ;
ELSE  c1dthyr = YEAR(death) - 1971 ;
yrslived = c1dthyr ;
dthy1 = YEAR(death) ;
end ;


else IF cohort = 2 then do;
IF death = . then c2dthyr = 12 ;
ELSE c2dthyr = YEAR(death) - 1980 ;
yrslived = c2dthyr ;
dthy2 = YEAR(death) ;
end ;


There were two cohorts, one for which data collection began in 1971 and one in 1980. I wanted the option to either analyze two different variables (and possibly split the data set later), so I created two variables named c1dthyr and c2dthyr. I also wanted one variable because I wanted to be able to compare survival rates by cohort, so I created a variable named yrslived. I was working with a student who was interested in deaths in a specific range of years, so I created variables dthy1 and dthy2 for her. The YEAR function returns the year part from a SAS date. A DO loop for each cohort took care of all of that.


DROP  b bd F_A_X01 -- I_AUTPSY MR1_CS1 -- A_F_DT1 D_A_DT2 -- H_DISDT4 MRCSDT1 -- MR3_DT4 IREVMVA -- NTIS5 LABsum1 -- E_MRDX Vis1y1 -- vis9y9  E_CR1 -- PRCS43  B_INTR -- MRCSDDT23 ;

The last statement drops a bunch of variables I don’t need. The somevar — othervar  notation with the two dashes in between includes all of the variables in order in the dataset from the first variable mentioned to the second. There were literally hundreds of variables I wanted to drop, so here they are all dropped. Now, I have the file I want and I am happy and ready to start running stuff for my first day of class.

 

I am pissed.

As a small business owner, I am feeling very, very disappointed that there is certainly some law out there that would impose penalties if I drove on over to Riverside County and bitch-slapped Darrell Issa. I’ve grown cynical enough in my old age and after having run a small business since 1985 that I am used to every politician under the sun spouting “Think of small business!” as a knee-jerk reaction to anything, whether their position is for it or against it. Usually, they are easy enough to ignore. Payroll taxes are not going to decide whether or not I hire people – business demand is. Health care – we’ve made that an option for our employees long before it was required by law.

This time, though, they are REALLY pissing me off. Let me tell you what the Research Works Act is and how it really does hurt my small business. As this succinct article by Janet Stemwedel on the Scientific American blog site explains well, not only does it require the American taxpayer (that’s me!) to pay twice for the same research, but also, the very people being protected and profiting are NOT those who produce the work to begin with.

Right now, if a person is funded by federal funds, say, the National Institutes of Health, they are required to submit the results of their research to PubMed’s repository within twelve months of publication. The idea is that if the public paid for this research then the public has the right to read it. Sounds fair, right?

In case you don’t know, rarely do authors get paid by journals.  I’ve published articles in the American Journal on Mental Retardation, Research in Developmental Disabilities, Educational and Psychological Measurement – to name a few. I’ve been a peer reviewer for Family Relations. For none of this did I get paid. That was fine. Almost all of the research I did was funded by federal funds and part of the grant proposals included dissemination – that is, publication of scientific articles. Fair enough. As a peer reviewer, I’m just repaying the service others have done in reviewing my work. Again, no problem.

Yet, in many cases, if I need a journal article for a grant or report I’m writing for a client, it is going to cost me $30 per article.  Contract research is a good bit of where the actual money comes into this company (you didn’t seriously think I made my living by drinking Chardonnay, spouting wanton programming advice and snarky comments, did you?)

The journal did not pay for the research to be conducted – the federal government did. The journal did not pay the author – the federal government did. The journal did not pay the reviewer – they volunteer.

SO WHY THE FUCK SHOULD I PAY $30 AN ARTICLE?

I just pulled up a random small project I had done recently for a client and there were seven articles in there that I would be charged $210 to have used. As I said, this was a small project, and I calculated it would have brought the price up 7.5%

Not long ago, there was a huge outcry about the city of Santa Monica adding a ten cent cost for a paper bag and banning plastic bags. “It will hurt small business!” people cried. Actually, I have always made a major effort to shop at our local businesses, I still do and it has not hurt any business anywhere that I can tell. You know what else I can tell you? That increasing the cost of a $3,000 project to $3,210 is a hell of a lot more significant than paying ten cents for a paper bag!

So what exactly is this bill doing? It is moving money from small businesses, like me,  and like my buddy, Dr. Jacob Flores, who runs Mobile Medicine Outreach and into the hands of large publishing companies, who not coincidentally gave a huge amount of money  to Democratic congresswoman Carolyn Maloney of New York. I’d like to bitch-slap her, too, but being on the opposite side of the country, it would be a lot less convenient for me.

You can read more detail in this article from the Atlantic, where Rebecca Rosen asks, “Why Is Open-Internet Champion Darrell Issa Supporting an Attack on Open Science? ”

As Danah Boyd points out on her blog, there is this new thing for sharing knowledge now, called the Internet and a major point of the Research Works Act seems to be to prevent it being used to share knowledge that I paid for with the approximately 50%  of my income (yeah 38% federal, 10% state) that I pay in taxes. And you know what, being a graduate from that great institution, the University of California, that enables me to make the money to pay these taxes, I don’t object to that.

What I DO object to is paying again for the same resources I already paid for once just because some lobbyists for large corporations lined Issa’s and Maloney’s pockets.

While it may not be legal for me to bitch-slap Issa it is certainly legal for me to go to the next California event where that lying-ass mother has the balls to stand up and claim to be helping California small business. Anyone who knows the next public event where he’ll be speaking, please hit me up.

One thing, though, I don’t think I’ll be going to any of his fundraisers. I think he’s gotten quite enough money from the publishing industry.

 

I’ve spent about 35 years messing around with computers based on the assumption that most discoveries are not preceded by “Eureka!” but rather,

“What the hell! May as well try it.”

Having my new computer pretty much dissolve in smoke (less than a month after I bought it!), I decided to continue my analyses of the Kaiser-Permanente data using SAS on demand . I had downloaded this as  a .stc file a while back, and used PROC CIMPORT to read it in, with all the formats, which I have mentioned before, is really, really easy to do.

I wondered what would happen if I uploaded the SAS dataset, you know, the .sas7bdat file  and the FORMATS catalog to the SAS On-demand server. Would it work?

Short answer – kind of.  SAS On-Demand is very forgiving when it comes to format errors. It appears to have the NOFMTERR option turned on by default. I’d leave it alone.  Look what happens when I do a simple table analysis. I get a table – sort of.

This table doesn’t use the formats.  No one reading it would know that 3 = “6+ drinks daily”. If it was just one format, I could use a PROC FORMAT and re-create it. No big deal.  There are hundreds of variables and thousands of lines of format code I’d have to write. My first thought of

“What the hell, I’ll just drag and drop the folder with the formats”

was a failure.In fact, despite the fact that Filezilla shows that my format folder was uploaded, it doesn’t actually show up in the SAS server at all.

Table on alcohol consumption with no formats

I suppose I could at this point go read some documentation or SAS user group papers – there is plenty out there -  on moving catalogs between systems. Seriously, where is the fun in that?

I still had the original .stc file lying around so I thought I would see if I could use PROC CIMPORT with SAS On demand.

I had read-only access to the SAS server because that is what the students have, and I wanted to replicate what a student could or could not do.

I uploaded the .stc file using Filezilla (nothing special about Filezilla, I just happen to like it).

When I looked in my the SAS library for my class, I did not see the file, even though all of the regular sas7bdat files showed up there, but I thought, what the hell, I’ll try the CIMPORT any way.

Then, I ran this code:

libname mydata "/courses/univ.edu1/i_432400/a_2469/saslib" access=readonly;
Filename readit "/courses/univ.edu1/i_432400/a_2469/saslib/04219-0001-data.stc" ;
proc cimport infile = readit library = work  ;

 

Where the part in the quotes in the LIBNAME statement is the library for my course on the SAS server.

and I got this error message:

ERROR: This transport file was created by an earlier release of SAS that did not record encoding.  If and only if you know this file was created from a SAS session running a UTF8 encoding, you may set the ISFILEUTF8 CIMPORT option to T (for TRUE) to tell us to proceed to import this file.  If it was not created by a UTF8 session, then you must import it in a session
encoding that is compatible with the session encoding that produced that file.

If and only if I know this file was created running a UTF8 encoding ? Are you kidding me? Despite all the best warnings from my mother not to touch things if you don’t know where they’ve been, I downloaded this off the Internet! It could have originally been encoded in Korean for all I know (it wouldn’t be the first time).

But then I said to myself,

Julia the computer gnome“Self, wait a minute here. If it is in fact UTF8 encoded and I say that is true, then it will work. And if it isn’t true, what’s the worst thing that could happen? A gnome with a candy-cane hat jumps out of my computer and slaps me? Not very damn likely.”

So, I added the option, ran the statement like this

proc cimport infile = readit library = work ISFILEUTF8=TRUE ;

My program ran, my  SAS log showed all the formats were output, when I ran analyses, my formats were there and used.

All was right with the world.

Just to make matters more complicated, let me throw in this  — if you had NOT made the library ACCESS = READONLY and tried to write to your class’s SAS library you would get a DIFFERENT error about the formats coming from a different system and it would not have worked.

My guess, and it is just a random guess, is that maybe the class library had formats that were inconsistent with the older ones being uploaded so there was a conflict, while, since the work library is just held in working memory, there was nothing there to be inconsistent with. I could like it up or Google it or something. But I didn’t.

Yes, it IS amazing that they give people with my level of maturity passwords to high performance computers. I find it hard to believe myself.

 

 

 

 

Glass of whisky. Public DomainWhen I met my second husband, the entire contents of his kitchen cupboards was a carton of Marlboro cigarettes, a bottle of Jack Daniels and a loaded Magnum revolver. Despite that, he was extremely healthy. His reasoning was that he was a big, tough guy while viruses and bacteria were little bitty things, so the nicotine and alcohol killed them before they could get around to making him sick.

In my continuing saga of mucking about with open data, I’ve been looking at the Kaiser Permanente data from the study of the oldest old. I’m interested in testing out the claims of a friend of mine who recently turned 65. He says that he may as well keep up drinking 8 or 9 beers a night because it’s too late to quit now. He’s already too old to die young.

The original data set had over 40% of the data missing on this variable but I noticed that people who said they did not drink were not asked this question. So, I added a couple of IF statements that set the number of drinks per day to zero for people who said they did not drink. This left me with four categories.

Window on right of Summary Tables window with format optionLittle nifty note for if you are using SAS Enterprise Guide, and you want your data to be presented in the table according to the formatted values, rather than the unformatted values, look to the right of the task window for summary tables and you’ll see an option to sort by and in what order.

This was useful for me because the default is to sort by unformatted values and the unformatted values were

  • 0 = never,
  • 1 = 6 or more,
  • 2 = 3-5 drinks per day and
  • 3 = < 2 drinks per day.

The results can be seen in the table below. In fact, people who did not drink at all, at least in the past year, had the longest life span, at 85.4 years. People who drank two or fewer drinks per day had a life-expectancy of 84.8, those who had 3-5 alcoholic drinks per day lived an average of 81.9 years and those who drank six or more did have the lowest life span, a still pretty old 81.7 years.

It’s worth noting that almost 20% of respondents had no data for this question and they had the longest lifespan of all at 85.8 years. I don’t know what to make of that other than that possibly those who tell researchers it is none of their damned business how much they drink have less repressed anger and stress and consequently live longer.

Table showing average lifespan by number of drinks per day

 

Were these differences significant? My next analysis was to do an ANOVA with lifespan as the dependent variable and the four alcohol consumption groups as the independent. I dropped the people with missing data.

There was a significant difference (F = 19.55, p < .0001), however, keep in mind that this is a fairly large sample meaning that even relatively minor differences will be statistically significant. In fact, drinking only explained about 2% of the variance in lifespan, and that’s assuming that there aren’t any confounding factors like women drinking less and also having greater longevity.

Another nifty tip for SAS Enterprise Guide. If you would like to export a graph in a project, say, so that you could post it on your blog, all you need to do is right-click on that graph and select SAVE PICTURE AS

When we look at the box plot produced by the ANOVA, it’s evident that there is very little difference between the last two groups.

We knew that from the table of means above, anyway.  Perhaps more interesting is that fact that a Tukey post hoc test showed all groups to be significantly different from one another EXCEPT for the last two groups.

In other words, for my friend who drinks 8-9 beers a night to see any increase in life expectancy he would have to drop his alcohol consumption to less than 25% of his current level. Simply cutting it in half would not really yield him a significant benefit.

Of course, I do suggest that from time to time and his response is:

Look, you’re not my mother. If I have any need of a woman telling me what to do, I already get it fulfilled when I visit my mother three times every week. She’s  88 years old.

I wonder how much she drinks?

 

(Here is one place to check if you are seriously asking yourself,  “Do I drink too much?” because I can pretty much guarantee you that a blog fueled by Chardonnay is probably not the best place to answer that question.)

In my analysis of data on the oldest old from the Kaiser Permanente study, I started with cigar smoking because I have a friend who turned 65 last month. His annual physical went like this:

Doctor: Do you still smoke cigars?

Jim: Yep.

Doctor: Do you still drink 8 or 9 beers every night?

Jim: Yep.

Doctor: Are you going to quit?

Jim: Nope.

Doctor: Well, see you next year.

You know how old people always tell you that it doesn’t matter if they quit now, because they’re too old to die young anyway? I got to wondering if that was true, if people who were smoking and still alive at age 65 would be any more likely to die. (We’ll save the 8 or 9 beers thing for later.)

My first analysis was to look at the data by gender because I suspected that men were more likely to smoke cigars and I knew that women have  a longer life expectancy. When I did a table analysis by gender and cigar smoking using SAS Enterprise Guide, I found that only one of the 288 cigar smokers in the sample was female. So, it was clear to me if I was looking at the impact of cigar smoking on mortality I should probably only look at men.

The average age at death for men who smoked cigars (N=175) was 84.2 years, compared to 84.2 years for men who did not smoke cigars (N=1,147). Since there was some rounding here, p does not equal exactly 1.0 but rather p > .90.  So, it does not seem that quitting cigar smoking would improve your life expectancy given you have already made it to age 65.

However, that is only people who died. Perhaps not smoking will increase your probability of survival? Let’s just take a simple chi-square analysis comparing people who smoked cigars to those who did not and seeing what is the probability they died over the nine years they were followed.

Of the cigar smokers, 13.2% died during the nine years participants were followed, compared to 13.1% of men who did not smoke cigars (chi-square = .07, p > .78).

Still, I am not convinced. Maybe cigar smokers were older, or less likely to be overweight or drink (well, I doubt that one).  Anyway, being the good little statistician, I ran a logistic regression with death as the dependent variable and BMI, education, age at the beginning of the study (baseline), whether or not the subject had ever drank alcohol and whether or not he smoked cigars as the predictors.

Table of Type 3 effects

As you can see, the only two significant factors were age (older people were more likely to die in the next nine years – not surprisingly since the study began with people 65 and over) and your education. I’m guessing education is a proxy for social class.

So, it looks as if Jim is okay to keep smoking his cigars, given he has made it this far. I’m still not convinced about the 8 or 9 beers, though. That’s my next analysis.

 

(Yes, that title does sound like a lot of the spam comments I get. )

Last year, at the Gov. 2.0 conference in Santa Monica, Jean Holm, from NASA spoke about some of the opportunities for open data. I left with mixed feelings. On the one hand, the best examples she gave were, I thought, of  “semi-open” data, that is, a term I just made up for having more openness of data within an organization.  In one example, there would be a database of the capabilities of researchers within NASA, so if I was a NASA electrical engineer and I had an idea for designing a better electrical system for a lunar module, I could find out who had related expertise in hardware design, reliability testing, etc. That is a great idea, but it also makes me wonder to what extent open data within an organization would be put to use. It depends a lot on the organization.  Many large institutions – whether corporate, government, non-profit or educational – are not very excited about people going around the usual chain of command or departmental structure, no matter how many times they chant, “Think outside the box!”

Many times, both on this blog and elsewhere, I’ve questioned the probability of useful discoveries coming from open data unless the individuals doing the analysis have some knowledge of statistics, programming and the structure of the data actually being used .  If we have ten thousand each people doing 100 analyses, how do we decide which half of the 100,000 statistically significant results is important and which are in the group of 50,000 statistically significant results we would expect to occur by chance with 1,000,000 analyses?

Ten thousand people doing 100 analyses = 1,000,000 results.  One of those will have a p-value of .0000001 just by chance. A one in a million coincidence happens one time out of a million, right? So, let’s say we get three of those p < .0000001 events. How do we know which ones matter?

 

She said, sure you can have people correlating everything and come up with nonsensical relationships like between number of flashlights sold and solar flare activity. Presumably, somewhere “out there” are scientists or consumers of data or someone who will be able to identify the real findings from the flashlight sales- solar flares relationships. Having read a lot of academic journal articles that appear to have been both written and edited by people who were either not paying attention, inebriated or both, I am not so optimistic.

So, I decided to do an experiment and see just how far I could get with some samples of open data. The first data set I chose was from the Kaiser-Permanente study of the oldest old. This is actually two data sets.

One of the reasons I chose these data are that they come with pretty comprehensive documentation. For example, after reading through several hundred pages, I knew that the first data set was the master file and the second was a hospitalization file. In my experiment to see if I could find anything useful in here at all (other than what had already been published), I decided to use just the master file.

A second reason for selecting the oldest old study is that there are some published statistics I can verify my results against to see whether I have the data read in and coded correctly from the beginning. For example, the number of deaths I had in the first cohort (1,565) and second cohort (1,751) matched their figures exactly.

I did not start out with any preconceived notions other than the general public, assumptions like the older people were at the beginning of the study, the more likely they were to die before it was over.  While I’ve worked on some health statistics studies in the past, I am not a physician. This is one reason I used the master file, instead of the hospitalization one. I know what acute myocardial infarction is but I could not really generate much in the way of hypotheses about it nor the accuracy of diagnosis.  On the other hand, dead or not dead is pretty objectively measured.

I started with cigar smoking because I have a friend who turned 65 last month. His annual physical went like this:

Doctor: Do you still smoke cigars?

Jim: Yep.

Doctor: Do you still drink 8 or 9 beers every night?

Jim: Yep.

Doctor: Are you going to quit?

Jim: Nope.

Doctor: Well, see you next year.

You know how old people always tell you that it doesn’t matter if they quit now, because they’re too old to die young anyway? I got to wondering if that was true, if people who were smoking and still alive at age 65 would be any more likely to die. (We’ll save the 8 or 9 beers thing for later.)

My first analysis was to look at the data by gender because I suspected that men were more likely to smoke cigars and I knew that women have  a longer life expectancy. When I did a table analysis by gender and cigar smoking using SAS Enterprise Guide, I found that only one of the 288 cigar smokers in the sample was female. So, it was clear to me if I was looking at the impact of cigar smoking on mortality I should probably only look at men.

The average age at death for men who smoked cigars (N=175) was 84.2 years, compared to 84.2 years for men who did not smoke cigars (N=1,147). Since there was some rounding here, p does not equal exactly 1.0 but rather p > .90.  So, it does not seem that quitting cigar smoking would improve your life expectancy given you have already made it to age 65.

However, that is only people who died. Perhaps not smoking will increase your probability of survival? Let’s just take a simple chi-square analysis comparing people who smoked cigars to those who did not and seeing what is the probability they died over the nine years they were followed.

Of the cigar smokers, 13.2% died during the nine years participants were followed, compared to 13.1% of men who did not smoke cigars (chi-square = .07, p > .78).

Still, I am not convinced. Maybe cigar smokers were older, or less likely to be overweight or drink (well, I doubt that one).  Anyway, being the good little statistician, I ran a logistic regression with death as the dependent variable and BMI, education, age at the beginning of the study (baseline), whether or not the subject had ever drank alcohol and whether or not he smoked cigars as the predictors.

Table of Type 3 effects

As you can see, the only two significant factors were age (older people were more likely to die in the next nine years – not surprisingly since the study began with people 65 and over) and your education. I’m guessing education is a proxy for social class.

So, it looks as if Jim is okay to keep smoking his cigars, given he has made it this far. I’m still not convinced about the 8 or 9 beers, though. That’s my next analysis.

 

Next Page →