I have way more pictures of buffalo than the normal person.
I guess that comes from working for 19 years on the reservation. After reading on GigaOm about the $28 billion in rural broadband access, I decided I wanted to write a post on Internet access in rural communities, specifically, the data we have on the Great Plains reservations. To illustrate it, I thought I would use a picture of a buffalo, which is the point where I realized that I have pictures of buffalo in the summer, in the winter, in groups, all alone and wearing a tu-tu. But I digress. Actually, this whole blog is a digression.
We collected these data over a period of several years ending in 2008 when funding agencies quit paying us to do this and started paying us to do something else. Actually, by that point we were also coming to the conclusion that access to the Internet on the reservations was becoming less of an issue in terms of the hardware and more of an issue in terms of content.
Over the four years or so of data collection we surveyed 701 people on Indian reservations in the Great Plains states, mostly in North Dakota, but also including South Dakota, Montana and Minnesota, with a few odd visitors from other reservations and regions thrown in who happened to be attending pow-wows in the area and were kind enough to give us their time and opinions. To give you some idea of the rural nature of the reservation, the picture above was taken half a mile from the center of the largest town on the Spirit Lake Nation. The respondents were pretty representative of the general population on the reservation; 57% of them had a high school education or less.
Interestingly, we found that nearly half (49.8%) of people on the reservations we surveyed had a home computer, and almost as many (47%) had home internet access. I was curious to see if there were differences between those who did and did not have home internet access. There was, but not as much as I expected. Those without internet access, on the average had 11.9 years of education, or a little less than high school, while those who did have internet access had an average of 13.2 years of education. For those of you who follow these things, the difference was statistically significant (p <.0001) and there was also a significant difference in variance.
As you can see from the graph, people who did not have internet access were far more likely to be high school graduates or less, while the people who had at least some college, even just a year or two, were far more likely to have internet access at home.
Near the end of the project, broadband access became available on the reservations. So, we began asking how many people had broadband acccess. We also asked how many people knew what broadband was. Since this was near the end of the project, we only asked about 10% of the people we surveyed. Guess what? By the end of 2007, about half of the people we surveyed on the reservation knew what broadband access was and those 50% reported having broadband access. We did not ask if they had it at work or at home, which, in retrospect, might have been a good question.
So, what difference does it all make and why is your government spending $28 billion on it?
Well, that is a very good question, but since it is past 11 pm and I have a glass of Chardonnay waiting for me, it will have to wait until tomorrow (the answer, not the Chardonnay, which I am going to attend to right now).
The NIH stimulus grants were supposed to be announced in August. Many people I know have already been turned down, no surprise since there were 10,000 applications for 200 grants. I checked on grants.gov last night and our proposal has been assigned a panel with a review date of 10/2009 which is very weird since I thought the grants were to be announced in August. I am just happy we are still in the running. We may have been shifted over to some other competition. I know some of the institutes had extra funding for Comparative Effectiveness Research (which ours was not) and other specific interests of the particular institute. This was the most interesting, exciting research I had proposed in a long time. For now, I am not committing to any new projects just in case we get funded. In any event, it was a ton of fun writing it, and how often can you say that about a grant proposal?
Bouncing back and forth between SAS, SAS EG, SPSS and a very tiny bit of Stata. Things I am learning that I like and don’t like
SPSS doesn’t do Cochrane-Armitage test for trend analysis – I mean, of course you can program it, but it is not an option, where in SAS it is incredibly simple
Proc freq data = datasetname ;
table variable1*variable2 / trend norow nocol nopercent scores=table;
I swiped this code out of a short and sweet paper from PharmaSug
SAS Enterprise Guide is good for data analysis and graphs but anything that uses functions, merges files, sorts files it is a little clunkier. While it may work fine for lots of people – the Excel and Access users of the world, of which there are approximately ten jillion – I do think some things still require code or are easier with it. Recoding lots of variable using arrays, use of functions like the input function and many more are much easier just writing SAS code.
Enterprise Guide does have some benefits to recommend it, though, in addition to the graphics, the much easier ways of doing summary tables (yes, I can type out proc tabulate code no problem but that is not 99.9% of the general public). I like how it is much easier to change data attributes from characteric to numeric and vice-versa. SPSS has had that for a long time. It used to be a bit of a pain in SAS and with EG there is nothing to it.
In short, though, computers are wonderful. I wrote a SAS program to pull all the paid-up users out of our database and am sending them all an email on how to renew their license with the setinit attached. I wrote a second program to pull out all those who haven’t paid and email them a reminder. Since my computer did all the work, I am going to reward myself for a job well done by heading over to the library and reading through the new Stata manuals to see if there is anything interesting on survival analysis. After that, I think I will see what is in these new SPSS modules that I received a month or so ago and still have not had time to examine – complex samples, exact tests and other goodies.
What more could a person ask out of life ? (Well, except for getting that grant funded.)
“I try to take it one day at a time, but lately several days have attacked me at once.”
If Erma Bombeck wasn’t really the first to say that, she could have been. I shouldn’t complain. Part of my overly full days are that I have one more child than Erma did and the same number of husbands. (Okay, well, I admit that I had more husbands TOTAL than Erma did but I had the same number at a time – one. ) Still, they are good kids and a good husband, even if I did have to practically dig out the house with a shovel so we could get the new carpets put in last week.
Well, my life with SPSS and SAS has been that way lately. Glad I have this stuff but sometimes I ask myself,
“Is it really necessary to be this much work?”
Who was it that said,
“You should have to figure out a mystery novel. Code should be read.”
That describes my view on programming quite well and documentation, too. Lately there has been too much of what I call “Scavenger Hunt Documentation”.
Do you want to know how to define a path for SPSS on a Macintosh? I wonder if this changed at some point because I found several websites that had it wrong. For example, I logged into my computer as annmaria and I want to open a file on my desktop in a folder named examples.
GET FILE = ‘/Users/annmaria/Desktop/examples/mystuff.sav’.
Let’s say the file is on a flash drive named something like “Kingston” because I did not bother to rename it. Then the statement would be:
GET FILE = ‘/Volumes/Kingston/mystuff.sav’.
Of course, if I had a Mac with SPSS installed at the time, I could have just opened a file, looked in the output window and seen what was written. It just so happened that I was answering a person’s question from a computer that ran Linux and did not have SPSS on it. (Yes, I know there is an SPSS for Linux, but I only have the Mac and Windows versions and at that moment every Mac in the house was occupied by a child, spouse, child’s friend or visiting niece. I know, I am deprived). So, I ended up looking all over the web and documentation to find the simplest piece of information. I feel immeasurable guilt because I cannot remember the url of the helpful person’s page where I found this.
This is my next project on my infinite to-do list. I am going to start a 20 basic questions page for SPSS, SAS and SAS Enterprise Guide and then a 20 totally random questions page just because I think it would be kind of fun. I remember a time when writing code was the only way to use SPSS and a GET File statement was the most basic beginning – but now it kind of falls under the heading of random and you have to dig through the drawer with the egg separator, the garlic brush and the beaters from the cake mixer you threw away last fall to find it.
And SAS – the install tragedy continues…
Today, I wanted to create two single-layer DVDs to install the 64-bit version of SAS. I could not remember how to create DVDs because I usually install via download from a server or off an external 160GB drive. We had created some DVD several weeks ago, but I couldn’t remember how, having done approximately 246,173 other things in the intervening weeks. I popped in the SAS Software Depot and the three options were Install SAS software (nope), Manage Software Depot (you can view and remove SAS software orders with this option) and Create a New software depot (nope).
I did find my answer in a page on the support.sas.com website , stimulated by a vague memory. Sure enough, to create DVDs you ignore what it says the Manage Software Depot does because someone apparently forgot to insert after that Manage Software Depot option – and create media. If you click on it anyway based on the assumption of “What the hell”, this last phrase describing a good bit of my life, by the way, you’ll be happy and pleased to find that it actually does break the depot into separate folders each of which will fit on one single-layer DVD.
I have concluded that my problem is that I am in the 20. You know that 80 – 20 principle, that 80% of your usage can be accounted for by your most common 20% of the features? Well, I am in that other 20% that uses everything, the Date and Time Wizard in SPSS (cool thing, by the way), the Surveymeans procedure in SAS and the histogram plots with a normal curve superimposed on the Distributional Analysis task in SAS Enterprise Guide. At least once during the month I will use Ubuntu, Mac, Windows XP, Solaris, Vista 32 and Vista 64. I guess, like the houseful of family, that is mostly good. Unfortunately, it means that first 80 hits on Google are never what I want. It is like someone once said to me at a party,
“If I call a company’s tech support and it is obvious that it has been outsourced to some other country, I hang up because all the guy there is going to do is to tell me whatever he read in the manual. If I call on Thursday, he is going to say, ‘Did you try A, B & C’ and I will have to say, ‘Yeah, on Tuesday. I also tried D through Z, and then I called you.’ There oughta be a manual for those who already read the manual.”
I wonder if such a thing exists? IBM used to have something like that years ago for their mainframes. I know one thing, if there is such a thing and it’s in my house you’ll probably find it in a corner of the closet under a pile of unused blankets and dirty laundry. We lost my youngest brother that way once. You think I’m kidding but I’m not.
“Is there anything you can do to help? I’d kill you but there is a law against it. You’d better leave before I figure out a way around that.”
This comment was made by a co-worker of mine who had saved all of the data for his thesis for a masters in computer science on his hard drive. Someone who needed assistance had stopped in his office, popped in a floppy disk and accidentally formatted the hard drive instead of the floppy. I tell this story just to point out that people screwing with your data is a phenomenon that dates back to at least floppy disks, which, if you ask my children, is equivalent to prehistoric.
Why You Need to Look at Your Data Seven Different Ways before you do ANY Statistical Analyses.
- The data were entered by clerks making minimum wage who hate that they are doing a job that, were it not for animal cruelty laws would be done by a half-trained monkey.
- The data were entered by really bright undergraduates at a prestigious university who smoked something really good before coming in to work. (Are they still called joints? Email me if you know the answer.)
- After you taught all day, graded papers, read the RFP for your next grant, you entered all the data yourself – and finished both data entry and your third martini at 2 a.m.
So… you have your data entered into SAS Enterprise Guide. Congratulations. The very first thing you should do is from the Tasks menu, select Describe and then, select the List Data option. If you have a small dataset, you may want to list the whole thing. Otherwise, click on the Options tab. In the window to the right in the drop-down box under Rows to list select ‘Every nth row’, giving a value for n, say 10. This is what statisticians refer to as a systematic random sample and what other people, who do not invite us to their parties, refer to as every tenth row.
The output is very plain vanilla, as you can see. You could make it prettier, but why? I do like the fact that SAS EG lets me output it as an html file so it can be uploaded easily and read by anyone. Because I do a lot of work as a telecommuter, this makes my life easier. Unlike most of what makes my life easier – the housekeeper, the detail car wash guy, Safari Books – the html output feature doesn’t charge me. So, props to it.
Go here for more step-by-step on how to use List Data. This is my personal university web page.
(I can link from here to there but not vice versa because some people are concerned about a rumor that this blog is written without supervision by the university attorneys, or in fact, by a responsible adult of any profession. This rumor is true.)
Next awesome innovation, go to Tasks again, then Describe then Characterize Data. This task reminds me of the first grader who wrote in his book report, “This book taught me more about penguins than I wanted to know.”
The characterize data task may tell you more about your data than you want to know if you just go with the default options, so I wouldn’t. I’d recommend unchecking the boxes next to Graphs and also the one next to SAS Datasets that produce the datasets containing Univariate statistics and frequencies. You may need those datasets or charts for every variable, but usually you don’t. It just slows down your job and produces a bunch of output you aren’t going to look at, especially if you have dozens or hundreds of variables. You may want to look at graphs for some selected variables later.
By default, the characterize data task will give you frequency distributions for categorical variables with 30 or fewer categories, and, for other categorical variables, the frequencies of the 30 most common categories. You can change the default from 30, if you would like. It will also produce descriptive statistics for all numeric variables, as well as the number missing values. Again, you can make your output prettier than my output shown here, with titles, footnotes and probably embedded images of bells and whistles, but since the purpose of this page is to check for out of range values, outliers, etc., why bother, unless you are really, really bored.
In some cases, it may be of interest to see if you have a normal distribution because you really do expect one. In this case, go to Tasks again, select Describe then Distribution Analysis. If you select Normal under Distributions, you can enter the hypothesized mean (except it isn’t hypothesized at all since you just saw it in the previous task) and standard deviation, too, if you so desire. Click on Plots and then select Histogram to see a histogram of your data with a normal curve super-imposed.
You can also use the Titles options to enter titles and footnotes, since one should never miss the opportunity to suck up to the funding agency. If for you want to change the output for some reason, say, you have a purple fixation, you can go to the Tools menu and select Options. Click Results, then HTML. You can select a different style for your output, then re-run the distribution analysis.
There. Purple. Are you happy now?
Actually, I am happy. The data look pretty good. Everything is pretty much in range, as shown in the descriptive statistics, not much missing data, the values and distributions on all the categorical variables are reasonable, the dependent variable is approximately normally distributed, so we are good to go on parametric models.
Reality check passed. For the data, that is. As far as those smoking, martini-drinking minimum-wage earning data entry people, the jury is still out.