I’m currently at a seminar in Washington, D.C.,  sponsored by the National Center for Education Statistics. I’ve seen the notices for these a lot of times over the years, and always thought, “Well, that looks interesting” but never applied to go, primarily because although your expenses are paid, there is no stipend, so that is a week of consulting hours I’m giving up.

This latest one, though, was on the National Indian Education Statistics and it was right up my alley. My former business partner and co-founder of Spirit Lake Consulting, Inc. (which I left to form The Julia Group) was also going.

So, what have I learned my first day? Well, that some of our assumptions were wrong. We had assumed, for example, that American Indian/ Alaska Native children were probably about equal to the total population average up through fourth grade or so and then the mathematics achievement gap would show up as students began attempting more complex problems.

At 9 months, according to the Early Childhood Longitudinal Study, Birth Cohort (which we learned about today), the two groups are about equal:

Chart showing equivalent performance at 9 months

However, the gap between American Indian/ Alaska Native children and the general population emerges far sooner than we had expected .. by age two years, even in early precursors to mathematics like matching and counting we could see a difference:
Chart showing AI/ AN children with lower performance at age 2

We had thought that perhaps there would be some areas of mathematics where American Indian / Alaska Native children would be relatively closer to the national average. Well, I thought that, Erich didn’t and he turned out to be correct.

As you can see here, the deficits for AI/ AN children in mathematics were both persistent and pervasive, extending across all domains measured and through the eighth grade, which was as far as we got today (hey, we’ve only been here a day, what did you expect?)

Chart showing deficits in all five domains for eighth graders

Actually, there was more that we learned today, a lot more, particularly about factors potentially explaining these achievement gaps. We came up with a whole list of positive and negative correlates and used the NCES web tools to test univariate relationships, but, of course, we are interested in much more than that.

Erich and I met for a couple of hours after the last session, drafting out a research design which we intend to apply to the NIES database that we expect to get our hands on tomorrow afternoon.

BUT .. it’s three hours later in D.C. and I have to be up early enough to drink at least three cups of coffee to stay awake in the 8:30 a.m session in the morning. So, I’ll talk about that tomorrow.

I should add, though, that not only am I learning a lot but they are also very generous with the swag. Being a statistician, of course, my idea of swag is not the designer cosmetics they put into the Gucci bag, but rather things like a book, Status and Trends in American Indian/ Alaska Native statistics, and software they are going to give us tomorrow to analyze the NAEP / NIES.

The latter is very exciting because the sample is based on Item Response Theory and I thought I was going to have to use SAS or SPSS for multiple imputation, neither of which are cheap. (SPSS is far less expensive but multiple imputation is one of those add-on modules. As a faculty member, I qualify for the SPSS faculty pack - an awesome deal, and I have access to SAS on the university computers – but not all my colleagues are in the same position).

NCES made this software sound like the greatest thing since sliced bread, so, I guess we’ll see if it stands up.

sliced bread
Speaking of Item Response Theory, that’s another good thing about this seminar. I grew jaded early in my career about the latest statistic being the answer to everything. It was factor analysis, then path analysis, then structural equation model, Rasch, IRT, MI and who knows what. Very often, an unneeded level of complexity was added just to show how smart and cutting edge the researcher was when a simpler design would have worked very nearly as well and been a whole lot easier to explain to the school board members who have to foot the bill, make policy and give us permission (or not) to come back and collect data next year.

All of that being said, the current study was the perfect situation to which Item Response Theory is the answer and Dr. Sikali did a brilliant job explaining exactly why.

I know we’re all busy teaching summer school, writing grants, publishing articles, conducting research and meeting with clients but if you see one of these NCES seminars that fits your research interests, I really recommend you go. I finally came because not one, but THREE of my friends who had attended previous seminars told me it was really worthwhile.

They were right. Your mileage may vary. But I doubt it.

Science is boring! Math is boring!

This is the whine of the world’s most spoiled 13-year-old as she does her homework, and I find it hard to argue with her because I have read her textbooks and all of them could be put to a better use as a cure for insomnia or starting a fire than actually teaching children.

Math and science teaching often IS boring, but it doesn’t have to be.

I just read Alan Alda’s book, Things I overheard while talking to myself in which he discusses making carrying a glass of water 25 feet dramatic by announcing if the student spills a single drop his entire village will die. (No, the university does not actually give him the authority to wipe out a student’s home town, but nonetheless, it is the SET-UP that makes it interesting.)

If he can make carrying a glass of water interesting, why can’t we do it for statistics?

One of the reasons students are bored is that we tell them about statistics rather than showing them. One statement I don’t hear very often in statistics classes is,

“Let’s see what happens.”

For example, in my last post I gave an example of a chi-square with two variables, both with two options, yes or no. I said that

One group is much larger than the other - in this case, the students who said they did use a computer at home were 91% of the total sample,

AND there is a significant chi-square, showing a relationship,

AND you can look at the cell chi-square values and see that most of this chi-square value comes from the cells of the smaller group.

Do you think this will always happen?

How do we get a cell chi-square value? It is based on how much the observed number in a cell differs from the expected number.

Let’s say that 60% of all of the people we survey said “Yes” that they sometimes use a computer at public facilities like libraries. If there really is no relationship between having a home computer and public use, we EXPECT that 60% of the 6,600 people who have a home computer will say, yes they use public computers (that’s an expected number of about 4,000). We’ll also expect 60% of the 630 people who DON’T have a home computer to say that yes they use public computers (that’s an expected number of about 380, if you’re keeping track).

When we actually run the cross-tabulation and look at the cell chi-square values, we find that the people who do have a home computer are right around 60%, but the people who don’t have a home computer are far off.

At this point, I would ask students in the class what they thought would happen if we had equally sized groups. Very often, people will guess that the group that has the huge cell chi-square values will continue to be responsible for most of the total chi-square value, whether the two groups (people who do and don’t have a home computer) were equal in size or not.

Let’s test it and see what happens.

Re-read that last statement. To create drama in any situation, that’s a very useful thing to say.

Fortunately, SAS has a surveyselect procedure I can use to select a stratified random sample of 140 people who do and don’t have home computers. I re-run the analysis and look at my new results. Lo and behold, I am correct!

The chi-square is still significant, but this time, instead of 90% of the chi-square value coming from the two cells for the people who don’t have a computer at home, this time the cell chi-square values are all about equal.

Why is that? Do you think this will always happen? Do you want to run it again with a different sample?

After running it a few more times, we can have a discussion on WHY the cell chi-square values are larger when you have an uneven distribution.

Yes, students memorize that you get a cell chi-square value by squaring (observed – expected) and dividing by the expected frequency.

Where do you get that expected frequency? You get it from the whole sample, right? So if almost all of the sample is in one category, your expected frequency is going to be whatever it is for that category.

Does this mean that the cell chi-square is a useless statistic? No. Sometimes it can be very useful.

Now, admit it, aren’t you at least just a little bit curious about what those times are?

picture of flames

Disclaimer: It is true that there are some students who are just not going to be interested in statistics unless the professor sets herself on fire, but for most students, just that little bit of set up, asking what the student thinks will happen, and that magic phrase, “Do you want to find out?” will spark students’ interest, at least for a minute.

IN CASE YOU ARE INTERESTED, THE CODE TO DO THIS IS BELOW


/* Create an html file and run the cross-tab with unequal distribution */

/* Request expected frequency, chi-square and cell chi-square statistics */


ods html file = "C:\TIMSS\sasout\chisq2.html" (title = "Cell-chi-square with UNEQUAL distributions") style = brick ;
Title "Cell Chi-Square and Expected Values with Unequal Distribution" ;
proc freq data = lib.student_int ;
tables Bs4GCels* BS4GCHOM / expected chisq cellchi2 ;
run;
ods html close ;

/* Sort the data by strata  */

proc sort data = lib.student_int ;
by BS4gchom ;


/* Select a stratified random sample of 140 in each stratum */

proc surveyselect data = lib.student_int out=computer method = srs n = 140;
strata BS4GCHOM ;

/* Re- run the analysis using the sample with equal Ns per stratum */


ods html file = "C:\TIMSS\sasout\chisqsample.html" (title = "Cell-chi-square with EQUAL distributions" )
style = brick ;
Title "Cell Chi-Square and Expected Values with EQUAL Distribution" ;
proc freq data = computer ;
tables BS4GCELS* BS4GCHOM / expected chisq cellchi2 ;
run ;
ods html close ;

When the first computer lab was put into the tribal college where I was a consultant, the professor in charge of the project complained that students were spending time in the lab on Yahoo, MySpace, emailing friends on other reservations, downloading software and all sorts of non-academic activities. He asked what he should do. I answered, “Let them, and count your blessings.”

That experience is consistent with the research showing that Native Americans make greater use of public internet facilities – principally school computer labs and library computers – than does the population at large.

Experiences like that one explain why I am a huge fan of public libraries. The rise of computers use for school work, wikipedia replacing the encyclopedia has made me even a bigger advocate. I have always believed that kids who don’t have computers at home, by necessity, make more use of library computers – but can I prove it?

Since I had my hands on the 2007 Trends in International Mathematics and Science Study (TIMSS) data, I thought I’d take a look at the relationship between having a computer at home and use of a computer elsewhere. The nice thing about the TIMSS data is it specifically asked about school computer use, so the questions asked were:

Do you use a computer at home?

Do you use a computer at school?

Do you use a computer elswhere (public library, internet cafe, friend’s house)?

It is true that this last question does combine a number of options (although I don’t think too many eighth-graders are going to Internet cafes). That’s what happens when you use data you didn’t collect yourself, but I decided to plow ahead anyway….

Below is the cross-tabulation of Use of computer at home by Use of computer elsewhere.

(Note for individuals with visual impairments: Click here for a larger version of the tables and for html compatible with screen readers . )

A couple of points I noted are:

  • 91% of students in the survey DO use a computer at home
  • Most students use a computer elsewhere whether they have a computer at home or not
  • More students who DON’T have a computer at home (76%) use a computer other places, like at the library, than students who DO have a computer (60%)
  • Another way to look at this is that students who DON’T have a computer at home are three times as likely to use a computer elsewhere as not about 75- 25. For students who DO have a computer at home, the odds are 60-40.

So, given all of this, it seems like for that 9% of students who don’t have a computer at home, access to computers other places is relatively more important. Let’s look at this in terms of statistical significance. We can start by looking at a chi-square value, shown below.

 

So, we have a fairly whopping chi-square of over 66 with a p-value less than .0001 . This tells us that that there is a very significant relationship between having a computer at home (or not) and using a computer somewhere else. The Fisher’s exact test similarly rejects the null hypothesis. So, there is a relationship.

But that doesn’t really answer the question. The specific question I want answered is do students who don’t have a computer at home use computers elsewhere more than students who do.

In this case, do I want to look at the cell chi-square value. That is, how much does each individual cell contribute to the overall relationship? Out of the chi-square value of 66, about 60 of it is due to the two cells under students who do NOT have a computer at home.

So does this tell me that I am right because the cell chi-square values for those in the “Do not have computer at home” column are so large?

Yes, and no, not at all. Yes, I am right. There is a relationship between not having a computer at home and using computers elsewhere more. The significant chi-square tells me that and the much higher odds of students who don’t have a home computer using one elsewhere tells me that also.

The cell chi-square does NOT tell me that. In fact, a cell chi-square value is comparing the obtained distribution between the cells to the expected distribution based on the population. In a case like this, where you have over 91% of the population in one column, the expectation is going to be driven by that 91%.

About 61% of the population uses computers outside home and school. If there is no relationship between having a computer at home and use of a computer elsewhere, you would find about 61% of the people using a computer elsewhere – for the population who have a computer at home, the figure is 60% (okay, 59.8%). But wait a minute! That 61% was based on a population comprised overwhelmingly of that group, people with computers at home.

What if we took a stratified random sample where we had equal samples of students with and without computers at home? Would our cell chi-square values be about the same for the two groups? Yes, yes, they would. Check back tomorrow for proof.

This doesn’t mean cell chi-square values are useless or should not be intepreted, by the way. It just means that in some cases where you have a very unequal distribution, you can be misled if you are not careful.

The code for doing these analyses is below, by  the way. The first statement creates the Html page and uses the brown style, because I like brown.

The second statement invokes the SAS FREQ procedure.

The first TABLES statement just does the cross-tabulation between the two variables for using computers elsewhere and at home.

The second TABLES statement does a cross-tabulation of the variable for using computers elsewhere ( BS4GCELS)  with using computers at home  (BS4GCHOM) and at school ( BS4GCSch) . The options at the end of that statement request a chi-square and cell chi-square.

ods html file = "C:\TIMSS\sasout\chisq.html" style = brown  ;
proc freq data = lib.student_int ;
tables Bs4GCels* BS4GCHOM ;
tables BS4GCELS* (BS4GCHOM BS4GCSch) / chisq cellchi2 ;
run ;
ods html close ;

Working on project for a new client and I was thinking about how we get work, which is a question I get asked all the time both from consultants and people interested in entering the consulting business.

It may be a self-fulfilling prophecy that we have received very little work from the latest marketing innovations of search engine optimization and social media strategies since we spend almost no time on those efforts, as a brief look at our website will tell you. So, where DO we get business?

1. Personal contacts - this is far and away the winner. These come from people we have written grants for in the past who now want us to do the evaluation piece – we have a LOT of clients we work with over and over for a decade or more. We get business from people referred by previous clients, former students, former professors, former employers who want us to come back for a specific project, former employees whose new employer needs a consultant. Of course, the longer we are in business, the more current clients, former clients and former employees there are, so personal contacts become more and more of an asset.

2. Conference presentations – this probably counts as personal contact also but I listed it separately because for all of the conference presentations I ever did – and I lost count but I bet it is around 100 – I had only received one contract. (There are other reasons for presenting at conferences.) That one contract was enough to cover all of my conference costs for life. Oddly, in the last year, we received two contracts from people who attended conference presentations, and met two others who want to hire us. I expect to get a contract from one of those two shortly. I have no explanation for this sudden change.

3. Central Contractor Registry - occasionally, someone sees us in CCR , needs a small business partner, contacts us and we work out something mutually beneficial.

4. Federal grants - this is second only to personal contacts in the business it brings in. Often, we’ll work with a client to write a grant and be included as a consultant to develop on-line training, or conduct the analysis. We do NOT write any grants on contingency with the agreement that if the grant gets funded we’ll get the work. (Short answer on why not: We have people willing to pay us on contract with 100% probability of being paid. Therefore, doing grant-writing that is unpaid for a less than 100% probability of getting paid for other work in the future is a bad business decision. Yes, other people do this either because they have excess capacity at the moment or because they are over-billing for the evaluation to compensate. Usually, both.)

5. Serving on committees – this hasn’t brought in business for me personally but a few small contracts have come from committees our other consultants have served on.

6. LinkedIn – I’m not very active on LinkedIn but I have received some work from people who I’ve met personally and then kept in touch with on LinkedIn.

I’m sure the answer to the question of how you find work depends on your industry and your services. We primarily do five things;

  1. Program evaluations, including quantitative and qualitative analysis,
  2. Statistical analysis, either a complete project or assistance with parts such as research design or programming,
  3. Grantwriting
  4. Development of blended (on-line plus on-site) training
  5. Delivery of blended learning offerings in ethics for Indian reservations and training in disability issues and services

We occasionally will do a contract for under $5,000 but the vast majority of our work comes in five- or six-figure contracts. With grants and program evaluations, our findings and our results may determine whether several people keep their jobs. Given that, what matters most is how comfortable clients feel that we can be trusted to deliver. It’s a very different environment than selling a purse or designing a business website or fixing someone’s home computer.

(Before the business website people get all snippy and tell me how important their work is to sales, let me point out that you can re-design a website tomorrow if it doesn’t quite work, but if your grant isn’t as perfect as can be by the deadline you’re screwed. )

I was very happy to get an acceptance letter from WUSS telling me that my class proposal for the 2011 conference was accepted. I’m usually happy to get acceptance letters in general but this was particularly nice since I have gotten the first four chapters of my naked mole rat book done and the next one is on logistic regression so the timing is perfect. The course description is below since I know you were just dying to read it.

Analysis of Categorical Data: For When Your Data Really Do Fit in Neat Little Boxes

What do birth, death, high school dropout, failing a course, losing a sale, engineering majors and being fired all have in common? If you answered, “Things my mother warned me about”,

you need therapy. The correct answer is all of these are categorical variables, where data pretty much DO fit in neat little boxes. You were either admitted to college or you weren’t. The customer either bought insurance from you or opened the door and let his pit bull chase you down the street. The voter checked the box for the Democratic, Republican or Independent candidate. You get the idea. This course begins with a one-hour discussion of simple statistics that are part of the UNIVARIATE and FREQ procedures.
The next two hours focus on a straightforward approach to PROC LOGISTIC. By the end of this course, you will be able to produce SAS analyses of categorical data and provide clear interpretations of the size of relationships, identify which variables matter in predicting outcomes, choose from competing models and whip out impressive graphic displays of your data and relationships. If you think McNemar and Akaike were on the same team in the last World Cup, this course is for you!

I read an interesting blog post by Jason Cohen of Smart Bear, “Do I dare call bullshit” and my comment was so long and my week is so filled, I thought I would just go ahead and repeat the comment here as my blog for today.

I certainly post things that piss people off from time to time. Since my blog is primarily on statistics, software and snarky comments, it is amazing to me that people occasionally feel upset enough to call me names, tell me I’ll be unemployed for life (this week, I’m so busy that sounds lovely but I’m sure after a while it would get depressing) and be insulting enough  to upset my daughters on the rare times they might read it.

If I think your application sucks, I will say so. I am sure that there are contracts I have not gotten, job offers that I have not received because someone checked my blog and was appalled at the fact that I occasionally run statistics to check absurd allegations, say, about anchor babies – (we Latinos prefer  to call it the miracle of life). It’s not calling a spade a spade that loses you business. It’s calling bullshit bullshit.

I have deliberately not moderated my tone because I think it best if you are looking for someone who toes the party line, who never voices a controversial opinion and who will try to spin what the data show that we not work together because we’ll both end up unhappy.

I’m a good writer, a good programmer, a good statistician. I’ve presented at more conferences than I can count and obtained millions in grant funding. HOWEVER, I’m allergic to mornings, I hate cold weather and this afternoon I am sharing my office with a guinea pig named Ali, in addition to my office frogs, Type I and Type II.

Ali the Guinea Pig I’m probably (okay. definitely) not going to work 9-5 for long periods of time, wear a suit every day or pretend that a piece of software that only works on the exact configuration used by the manufacturer is NOT a stupid idea.

I’m not so naive to know that blogs have morphed from personal journals on the web to marketing tools. However, I still write mine to remind me of things I want to remember – or just for the hell of it. Often when I’m working with a client, I’ll point him or her to a link on logistic regression or cluster analysis or installing SAS on Unix or whatever. Yes, my discussion of Wilcoxon and normality includes a naked mole rat, but there is a reason for that. Many of the people I work with are really, really smart in their own disciplines but are somewhat intimidated by statistics. It’s hard to be intimidated by an animal that looks very much like a penis with teeth.

All of that said, I DO censor myself where it would hurt someone, even their feelings, unnecessarily. If I read an article that makes a mistake that is common or is just terribly written, I’m not going to call out that person just because their article was the one I randomly read.  There are tons of articles out there that make mistakes or are terribly written. I don’t mock people’s names, personal appearance, race, gender or even their mistakes because that is just mean. When I was younger, I would do things like that and my friends and I all thought we were very clever. Then I became mature enough to learn the difference between trying to be clever and acting like an ass.

And I don’t bother writing back to email from morons, because if you truly are a moron, an email from me is not going to change you.

So, yeah, all in all, I think calling bullshit can be a good thing if the result is people who want to work with you really want to work with YOU and not some image you’ve cultivated.

I admit that I am hardly a typical user for statistical software, given that I had to go online to download a file with less than 7,000 records and less than 500 variables. STILL, I have to say that PSPP was a disappointment in a great many ways. This is NOT to say that it isn’t a nice little package. It’s just not a nice package for me. (I think I may have said this to some boyfriends in college, but I digress).

The first file I downloaded was a text file on a survey of public libraries. PSPP does not allow the option of moving a line using your mouse to designate where variables are split. No problem, I thought, I’ll just read it in and use a substring function to read the first seven characters of the field, which were the population. Unfortunately, it turned out that when the population was under 1,000,000 that there was text data in that field.

Fine, I’ll just download it as an ACCESS file, the other option. Except it turns out that PSPP doesn’t import any types of files except text. It would open a .sav file but I did not have one on this machine that I was interested in using.

The reason I didn’t have one  I was interested in using is that PSPP only does really basic statistics and all the .sav files I had were ones I was using for studies that did repeated measures ANOVA, cluster analysis or factor analysis. None of these were options for PSPP.

PSPP does do descriptive statistics, regression and one-way ANOVA. It has a lot of functions listed, I presume they all work. The substring function worked fine. I didn’t try much else.

If I have to search around for  a dataset for which a package can be used, it’s clear it’s not going to meet my needs.

Who IS PSPP good for ?

I have worked with students who needed no more than a basic package for an introductory statistics class and they should be thrilled to have PSPP as a free alternative to buying yet one more thing in a semester. It is pretty easy to use for the limited set of statistics it does. I did finally download a file off of our server that was in Excel format, opened it in LibreCalc, saved as a .csv file and had no problem opening it in PSPP. I did a multiple regression and it gave me everything I wanted, ANOVA table, R-square, Adjusted R-Square, beta-weights, t-values, etc. etc.

The default formatting of the output is nothing to write home about, but you can select, copy and paste it into LibreWriter, which is handy.

If you just need to do your own study for a relatively simple project and are going to be entering your own data, or maybe entered it in Excel and saved it as a text file, then PSPP should work fine.

I was looking forward to working more in Linux, since I think knowledge of operating systems is like everything else – use it or lose it. So, I was disappointed that PSPP is not a useful option for me. On the other hand, a lot of Linux packages get built up over time, and this is certainly a nice basic start and it may turn out to be a full-fledged package eventually. If anyone is interested in my opinion, I’d say the two biggest lacks at this point are the inability to import other formats besides .txt /.csv  and the limited number of statistical tests available. Correlation would be the first thing I would add, followed by Repeated measures ANOVA and logistic regression.

As it stands right now, PSPP is a nice little package, emphasis on the little.

 

Rocky & BullwinkleWhen last seen, our heroes were attempting to write a book with the title

Beyond SAS Basics: Tips, Statistics and a Naked Mole Rat

The first chapter was entitled

After the Data Step. The first half of it was posted here earlier which you would know if you were following this blog in the probably vain hope that you might learn something.

Writing the second half of the chapter was delayed by people offering to pay me actual money if I would fly around the country to hither and yon and do work like a real grown up. I didn’t make it to hither, but you can see a picture of yon below.

Lac du FlambeauNow that I have returned,  I have completed the rest of Chapter 1. To whit …

The next section could have been an entire book in itself. I LOVE statistics. I have spent most of my life as a statistician, and also won a world judo championship and married three husbands (not at the same time). It is a myth statisticians are boring and it is not true that math is hard, I don’t care what that stupid Talking Barbie doll said. Math is a lot easier than unemployment, in the opinion of most people. Since this book is titled “Beyond the Basics”, I did not include means, frequencies or correlations in the statistics section. I could have included simple linear regression or one-way Analysis of Variance – I know those are not that basic to most people.

If at this point your eyes are starting to glaze over and you’re starting to get anxious, just cut it out right now! You’re NOT that bad at math and it’s NOT that hard. It’s not rocket science. Besides which, having been married to a rocket scientist for fifteen years, I can tell you that they aren’t perfect, either. This section includes just two chapters. The first is on logistic regression – when your data really DO fit in neat little boxes – like did someone live or die, buy your widget or walk on by, vote Democrat or Republican. These are the kinds of things we want to predict on a daily basis. The second chapter in this section is the most common research design for testing whether something works – an experimental group and control group are each given a pre-test and a post-test. Read all about it in the chapter on Repeated Measures Analysis of Variance. If I have not convinced you, you can skip this chapter and still understand the rest of the book perfectly. Then, you can wait for my next book – Hamster Statistics with SAS. (Under suggestions for next year’s topic, one conference attendee wrote, “Statistics so simple a hamster can understand it – Bring your own hamster.”)

This next section is for those of you who don’t like statistics – and for those who do.
“Public agencies are very keen on amassing statistics – they collect them, add them, raise them to the nth power, take the cube root and prepare wonderful diagrams. But what you must never forget is that every one of those figures comes in the first instance from the village watchman, who just puts down what he damn pleases.” – Sir Josiah Stamp

People who don’t want to get too involved in statistics can take comfort in the fact that many statistical results are flawed because the data are of poor quality. If this describes you, there is plenty of work available out there fixing the data. I’ve read books that asserted that 80% of the time in any data analysis project is spent on data cleaning and data management. I deeply suspect that they just made this number up, but, as Dilbert said, studies have shown that real numbers are no more useful than numbers you just make up. (How many studies have shown this? 42. I just made that number up, too. See, it works!) My point, and you may rightly have despaired by now of me ever having one, is that a very large proportion of the amount of time on any project goes into fixing the data. So, if you don’t want to get very involved in statistics but you still want to use SAS for fun and profit, specialize in data quality improvement and you will be the life of the party. (Of course, that will only be at parties attended by nerdy SAS programmers but judging by the fact that you are reading this book it is assumed that you will fit right in.)
For those of you who DO love statistics (and, please, come sit next to me), the section on data quality is essential because unless you’ve been hanging out at parties where you met that guy in the last paragraph (and you didn’t invite me!), then you need to make sure your data are as near to error-free as you can get.

Section four is an introduction to SAS macros. There are a lot of reasons to like SAS macros. Any time you do the same type of task repetitively, you could write a macro and just supply the information that changes. For example, say you have a report you do for 24 different departments and the only difference is the name of the dataset you read in, the name of the department in the title and the name of the department manager – macro material for sure! Another reason to like macros is that a lot of the concepts you learn are applicable to other programming languages beyond SAS, and we’re all about being generalists here.
The main reason not to like macros is that they look like they are written in Micmac. (Micmac , also spelled Mi’kmaq)  is the language of a tribe native to Canada. For a sample of the language, see Exhibit A.

Scroll in MicmacExhibit A

For a sample of a macro, see Exhibit B, from a 1997 paper by Art Carpenter*. I was right, wasn’t I ?).

.

.

.

.

EXHIBIT B

%do q = 1 %to &n;
PROC FSEDIT DATA=dedata.p&&dsn&q mod
SCREEN=GLSCN.descn.p&&dsn&q...SCREEN;
RUN;
%end;

There is also the problem that the way people learn the macro language is usually sufficient to send them screaming in the opposite direction. Macro processing is taught beginning with several chapters on parameter scope, tokens, quoting and masking text.  Instead, I’ve included a couple of macros so you can see right away how useful macros can be and learn the statements and functions as we go along.
So, now we come to the final section, which is the “where do you go from here?” Since I don’t know you well enough to differentiate between you and a hairless monkey, it’s a bit surprising that I have an answer for you, but I do. The secret to keeping excited about the work you do and keeping other people excited enough to pay you is NOT Viagra, regardless of the 1,247,877 emails you have received. In fact, the answer is to really and truly keep learning. This section includes recommended resources from websites to mailing lists to conferences to specific books and papers I found both useful and interesting.

It also includes a a naked mole rat.

* Carpenter, A. L. (1997). Resolving and Using &&var&i Macro Variables .

How to make SPSS 18 do graphics whether it wants to or not …

SPSS isn’t the statistical package I use most often – a few times a month when clients request it. I have SPSS 18 on a Mac and it works fine pretty much, until recently when someone felt the need for a graph produced with SPSS and I received this message:

Warning
Could not start Virtual Machine. Chart will not be drawn.

If you should have this same, sad experience, here is what to do:

1. See what release you are running. You can find this in SPSS by going to the menu at the top left corner that says PASW Statistics 18.0 and selecting the first option that is ABOUT PASW STATISTICS 18.0  – a window will pop-up and say something like PASW Statistics Release 18.0.3 If it is a number less than 18.0.3 you need to update it.

2. You can either register with the SPSS Support site, or if you would rather not register or, like me, cannot be bothered to look up your password, you can skip this step and just use spguest as your login and password

3. Go to this page

https://www-304.ibm.com/support/docview.wss?uid=swg21488075&wv=1

and use your login and password to get to the downloads page. Download and install the patch to upgrade to 18.0.3 for Mac OS X.  This is free. Thank you IBM/SPSS

4. Go back to the same page and download the patch for 18.03.4   (No, I do not know what happened to 1, 2 & 3).

The rest of these instructions are copied from the last part of the README file you will find in the folder when you unzip the patch you just downloaded. I pasted them verbatim because I thought they made an interesting point.

(B)
Download the ‘IBM SPSS Statistics 18.0.3.4 Hotfix Mac.zip’ file.
(This file will normally be saved to your ‘Downloads’ folder under your ‘home’ user account. The following instructions assume that this is where the file has been saved.)
Open the Macintosh hard drive icon on your Desktop.
Open the ‘Applications’ folder.
Open the ‘SPSSInc’ folder
Open the ‘PASWStatistics18′ folder.
Locate the ‘PASWStatistics18.0′ program launch icon.
Press and hold the ‘Control’ (Crtl) key on your keyboard.
Click the ‘PASWStatistics18.0′ program launch icon.
Select ‘Show Package Contents’ from the menu options in the pop-up window.
Release the ‘Control’ (Crtl) key on your keyboard.
Open the ‘Contents’ folder.
Open the ‘lib’ folder.
Locate the following file

libspssjvm.dylib

Press and hold the ‘Control’ (Crtl) key on your keyboard.
Click the ‘libspssjvm.dylib’ file icon.
Select ‘Get Info’ from the menu options in the pop-up window.
Release the ‘Control’ (Crtl) key on your keyboard.
Select the triangle toggle next to ‘Name & Extension’.
The contents of the ‘Name & Extension’ box should be as follows:

libspssjvm.dylib

Edit the contents of the ‘Name & Extension’ box to be as follows:

libspssjvm.dylib.orig

Select the ‘x’ in the upper-left corner of the ‘Get Info’ window to close this window.

A window with the following message will appear:

Are you sure you want to change the extension from “.dylib” to “.orig”?

Select the ‘Use .orig’ button.
Minimize the window displaying the ‘lib’ folder.
Open the ‘Downloads’ folder located in your ‘home’ user account.
Locate the ‘IBM SPSS Statistics 18.0.3.4 Hotfix Mac.zip’ file.
Double-click on the ‘IBM SPSS Statistics 18.0.3.4 Hotfix Mac.zip’ file to extract the contents.
A folder labeled ‘IBM SPSS Statistics 18.0.3.4 Hotfix Mac’ should now appear in the ‘Downloads’ folder.
Open the ‘IBM SPSS Statistics 18.0.3.4 Hotfix Mac’ folder.
Locate the following file:

libspssjvm.dylib

Press and hold the ‘Control’ (Crtl) key on your keyboard.
Click the ‘libspssjvm.dylib’ file icon.
Select the following menu option from the pop-up window:

Copy “libspssjvm.dylib”

Release the ‘Control’ (Crtl) key on your keyboard.
Minimize the window displaying the ‘IBM SPSS Statistics 18.0.3.4 Hotfix Mac’ folder.
Open the window displaying the ‘lib’ folder.
Select the ‘Edit’ menu option at the top of the window.
Select the ‘Paste Item’ menu option.
The file ‘libspssjvm.dylib’ should now appear in the window (immediately preceding the ‘libspssjvm.dylib.orig’ file).
Close all open windows.
Launch Statistics 18.0.3 and test the behavior that previously failed.
Report results.

———————–

I am happy to report my results as “It now works fine.”

I am also happy because both of the patches were free. I happen to think if you buy software and it quits working that the manufacturer should fix it for free, but they don’t always agree with me on that.

In over 30 years that I have worked with people from IBM, I’ve always been impressed with their customer service. That’s all I know about them. It may be that managers throw naked mole rats at the sales staff every day at 3 p.m. solely for their own amusement.

On the other hand, as I read all of the steps above, it does not strike me that this is particularly user-friendly. I tracked down the spguest work around because I didn’t feel like looking up my password or re-setting it. That was the only bit of a glitch I had, but I can see lots of my clients looking at those instructions and panicking, and well they might because if you accidentally rename the wrong file of about a bajillion similarly named files, this will not work.

Here is the interesting point – the upgrade is free, and it works with no problem, BUT it requires you to invest a little bit of effort to install it and if you don’t have at least a basic degree of technical knowledge,  you might feel uncomfortable just blindly following directions. There is also the possibility that you could rename the wrong file or do something else that will screw things up.

In all of this, it seemed EXACTLY like Linux. Not that there’s anything wrong with that.

After reading that Curt Monash writes FIVE blogs on software and marketing, I felt as if  I had no excuse for not keeping up lately. However, for my excuse, you can click the link below:

For my cinchcast explanation of why I have not been blogging. Also, anyone with suggested uses for a yoke, let me know.

I’m in Green Bay, Wisconsin at the CANAR conference to give a presentation on analysis of ethics data. CANAR is the Consortium of Administrators of Native American Rehabilitation, in case you did not know (which, be honest, you did NOT know).

Qualitative research would not be my first choice in most situations, but if there is a really good quantitative measure of ethics, I don’t know it and I can guarantee if such a thing exists it wasn’t validated for Native Americans on reservations. So… we have about 2,000 responses , mostly to different case studies and I get to talk about some of those in 90 minutes tomorrow – twice, actually, because they asked to have the same presentation repeated so twice as many people can attend.

If you’re interested in what we found out, you could read a few posts I wrote on it earlier:

The Reservation Rush from Judgement – many people bend over backwards to avoid saying behaviors are unethical even in the most obviously unethical situations, “I need more information than that the board member approved falsifying records and can’t attend meetings because he is in jail for assault”. WTF?

What I learned about ethics from Sherlock Holmes – that sometimes the most interesting results are non-results. I was amazed by the number of people who simply did not answer the case study questions – although they did answer the multiple choice questions, and other items that required writing about themselves.  When they did answer the case study questions, their answers were almost invariably short ones, along the lines of “not sure”. It wasn’t a literacy issue. What I concluded from it was that it was very much like when I was teaching math, students who were unsure of their answers wrote as little as possible.

One more result, before I get downstairs to the conference – I was astounded at the number of people who responded that case studies that I thought were so obviously unethical (for example, paying for your wife to come on a trip with you out of your travel budget) were considered just fine. My co-author, Dr. Erich Longie, who had written many of the case studies, was not surprised at all.

The groups I have worked with that most closely approximates what I have seen in terms of ethics in these analyses are executives at Fortune 100 companies and members of non-profit boards. What all three groups have in common is that there is very little oversight of the decisions they make. The other commonality is that this lack of oversight can lead to some very big problems.

And now, I’m off to the conference sessions ….

 

 

Next Page →