Today’s post courtesy of Captain Obvious …

To do quintile matching, one must first match by quintiles. Hence, the name.

Quintiles divide your data set into five equal, um, fifths. Quint is Latin (or Greek or some other random language) for five. Hence, the name.

So, when I did this:


VAR prob ;

OUTPUT OUT  = testquint PCTLPTS 20 40 60 80


I expected to get the values that divided the data set into quintiles.

When I did this, because I was too lazy to avoid typing the numbers,

/* write the quintiles to macro variables */
data _null_ ;
set quintile;
call symput(‘q1’,pct20) ;
call symput(‘q2’,pct40) ;
call symput(‘q3’,pct60) ;
call symput(‘q4’,pct80) ;

/* create the new variable in the main dataset */
data AllPropen;
set AllPropen ;
if prob =. then quintile = .;
else if prob le &q1 then quintile=1;
else if prob le &q2 then quintile=2;
else if prob le &q3 then quintile=3;
else if prob le &q4 then quintile=4;
else quintile=5;

proc freq data = allpropen ;
tables quintile ;
run ;

I expected to get five, even groups.

I did not.

I got three groups with 1,088 records but my first group had 1,075 which is, obviously less than 1,088 and my second group had more than 1,088.

I considered several possibilities. Did I misremember the meaning of percentile? Should it be LESS than the 20th percentile point instead of less than or equal to? If that is the case, why did the rest of the groups come out perfectly?

Did the macro facility for some reason not compare down to enough decimal places to see that the 20th percentile value was, in fact, equal? To check for that, I multiplied the probability at the 20th percentile by 10, then by  100 and compared it to 10, and then 100 times the 20th percentile, thus requiring one or two less decimal places. Still, not equal.

I used the %PUT to put the values of the macro variables for &q1 to &q4 to the log.

They were correct.

I re-ran the program in SPSS. Same result.

Finally, it dawned on me. I did a PROC FREQ and realized that, duh, there was NOT exactly a 20th percentile. While there was, for example, exactly one score at the 40th percentile, there were 11 people at the 19.76th percentile and 15 at the 20.04th percentile. There was not a single score at the 20th percentile so my SAS program could not give me an exact 20th percentile.

Thank you, Captain Obvious.

I have no idea why the obvious answer did not occur to me immediately, maybe because with a smaller data set, I wouldn’t expect to have several records match down to the 12th decimal place.

On the other hand, this further reinforces that I already knew about myself, which is that I am never satisfied with “close enough”. If it is supposed to be 20% and I get 19.76% , I want to know, why, damn it!

I think it also kind of shows how easy it is to get tunnel vision. I have spent the last few days focusing on some really, really complicated design problems, so when I got back to looking at these results from the PROC UNIVARIATE I had done last week, I began by assuming it must be something complicated, instead of starting with the most basic, obvious possibility first, which is what I am always telling other people to do.

As the hockey player in Slapshot said about the penalty box,

“Then you must feel shame.”



The Rocket Scientist is opposed to buying software that has licenses activated, on general principle and because we usually have at least six computers between the two of us – a desktop Mac each which are our two main computers, a laptop each for travel, a Windows machine or two for testing. Then there is the laptop and desktop (for visitors during sleepovers) in The Spoiled One’s room.

Any software that lets you license a personal copy on a set number of computers is grounds for a tirade. He suggested a number of alternatives to Dreamweaver for my latest project and I looked at a couple of them. In the end, I ignored his advice which I don’t usually do in these situations. It may be bloated and inelegant (people always say that about applications that have been around a while and keep adding features) compared to whatever the newest shiny thing is. I don’t know.  It certainly is expensive. What I do know is this:

1. I know how to use Dreamweaver. Although I haven’t been using it much the last couple of years, mostly doing sites in WordPress, I spent a lot of years using Dreamweaver, ever since I migrated from GoLive when Adobe bought it. WordPress won’t work for what I need and the hours to get equally familiar with anything new far exceed the cost of a license.

2. There are a lot of tutorials, forum posts, videos, books and blogs on Dreamweaver. Information is easy to find for what I need to do.

3. There are a lot of extensions that are plug and play. Yes, those cost money, too, but as I have discussed on this blog before, time has value. If I can buy something for $49 that would take me three hours to write, that is a good deal. If there is something out there free that does the same thing but it would take me thirty minutes to find it and another thirty to figure it out because the documentation is non-existent, $49 is still a good deal.

4. That whole bloated ware inelegant thing frequently saves my ass. Let me give you a minor example today. One of the words on our site is in Dakota. A person who is a native speaker of Dakota (yes, one of the few in the world and I am so happy to be working with her) sent me an email and said that we had spelled one of the Dakota words wrong. Unfortunately, we had been working on this for a while before she came on board and I was not too excited about going through every page and changing the spelling. I wasn’t even sure of what pages contained this word. I wanted a grep command for the site. Well, one of the things that Dreamweaver does is let you do a find and replace on your entire local site. As a general rule, this can be a recipe for disaster but in this case, it found 55 pages where this word was used – misspelled – and changed all of the occurrences to the correct spelling.

So this is why I love Dreamweaver today.



Having failed recently to use BMI as a variable from a data set on school children in our example for propensity score matching, because people who fill out surveys are big, fat liars, we next went to a sample of really old people and used death as a dependent variable

Try faking it on the dichotomous dead (yes/ no) variable. I dare you!

…  which caused me to start thinking about other differences when using logistic regression

By the end of two or three statistics classes, everyone knows that there are four types of sums of squares. Okay, well at least they know that there are at least two types of sums of squares, Type I and Type III.

The first one, coincidentally referred to as Type I, is also called the sequential sum of squares. Let’s assume you have two factors, Alcohol Use and Cigarette Smoking and from whether the student uses alcohol or smokes cigarettes, you are trying to predict how much he smokes marijuana. You have an Analysis of Variance with an continuous dependent and two categorical predictors. If your model use Type I sum of squares, and your statement is

MODEL  marijuana = alcohol cigarettes ;

You will get one estimate for the sum of squares for, say, alcohol. If your statement is

MODEL marijuana = cigarettes alcohol ;

You will get DIFFERENT estimates.

On the other hand, with the Type III sum of squares, you will get the same result regardless of order.

SAS by default gives you the Type I and Type III sum of squares and expects you to know what you are doing. SPSS only gives you the Type III sum of squares and expects you not to ask any questions.

There is a nice explanation of Type I and Type III sum of squares on Matt’s blog, which he actually re-posted from somewhere else, but I found Matt’s blog generally interesting so I linked there.

What if, instead of doing an ANOVA you were doing a logistic regression, with not how much marijuana the person smoked as the dependent but whether he ever smoked it or not?

Does order in your MODEL statement matter in logistic regression? The short answer is – no.

The fact that the table with the Wald chi-square is labeled Type 3 Analysis would tip you off there, but what about odds ratios, parameter estimates, concordant pairs? Nope, nope and nope.

Try it for yourself and see.

(Of course, you COULD do stepwise logistic regression – but I wouldn’t.)




I’m taking off for two weeks on a working vacation defined as I will be working while the rest of the family is on vacation. I found it interesting that the list of what I NEEDED to do in getting ready was

  1. Go to the Apple store and pick up a 500 GB hard drive. The second computer I ever owned was the original Macintosh – 128K RAM and all the storage you would ever need (!) on a 400K floppy. So, I am carting with me the equivalent of 1.3 MILLION of those disks, because, of course, I need that storage space.
  2. Install and license copies of Webstorm and Dreamweaver on my laptop.
  3. Copy thousands of pages of books on to my iPad, including Eloquent Javascript, the SAS documentation on logistic regression, Beautiful Data, Javascript: The Definitive Guide, SAS Documentation on Proc Mixed and six SPSS manuals.
  4. Downloaded an app my friend recommended for calling and texting internationally – Viber. I’m not so sure about it though, because some people told me they got charged for calls and texts even when they used it. If you know anything, please chime in.
  5. Charged my iPad, iPhone and laptop.
  6. Synced all my files using dropbox
  7. Saved the web pages I want to go back to on Delicious.
  8. Shared files with a collaborator on Google Docs
  9. Shared more files with other collaborators on our company intranet

The first thing I packed was a book, Practical Text Mining. I started reading this book right after I picked it up at SAS Global Forum and after 200 pages got sidetracked with work. It is the only book on my reading list that does not come as an ebook, which is unfortunate since it is over 1,000 pages long.

Since every story I ever tell about anything includes the line,

“I was out of town …”

it can be concluded that I travel A LOT. Packing has changed greatly over the past 30 years but here is one thing that hasn’t – Never put anything in a checked bag that you absolutely have to have when you get there. 

So, the second thing that went into my carry-on was underwear, along with an extra pair of contacts and an extra pair of glasses. I don’t know what it says about me that I packed a book on text mining first. I was actually thinking about the weight limit of the one bag the rocket scientist is checking when he flies out the next day with The Spoiled One. She insists on packing enough beauty products for a very ugly small country, so I figured I would take the heaviest thing I had in my carryon bag.

It goes without saying that my computer, external drive, iPad and iPhone went into my briefcase, along with one cable for charging. The other chargers could be packed.

With all of that done, I was almost packed. All I needed to do was throw some clothes in a suitcase – which at one point in distant memory was actually the definition of packing!



We are looking for data to use as an example of propensity score matching for a couple of upcoming workshop / classes. Since the data I have used previously belonged to other people, I needed to come up with an example that could be stated in a format something like:

Controlling for X, Y and Z are people who received treatment A more likely to kerfluffle than people who did not receive treatment A, where kerfluffle is the outcome of interest that is your dependent variable.

Our new intern, Chelsea (yay, Chelsea!) had the brilliant idea of looking at whether drinking whole milk was bad for you (remember high fat milk being evil before Coke became evil?) Drinking whole milk could be treatment A and being overweight could be the kerfluffle, I mean, outcome, of interest. We could control for other things like exercise, family income – except that when we started looking at this question with a large public data set of surveys of students we found diddly squat in the way of group differences.

Then, I tried it from a different perspective having listened to The Daily Show enough times to know that soft drinks are the cause of obesity.  I was also interested because I thought I was the only person in America who couldn’t stand Coke or Pepsi – tastes like melted sugar. So, I also ran some analyses using students who drank milk daily and never drank soft drinks as one group and students who drank soft drinks (non-diet) daily and never drank milk as my two groups. Still no difference, not in weight, not in computed BMI based on age, weight and height.

Why, you may ask, didn’t we do something like compare people who drank low-fat milk with those who only drank sodas? Here is the thing, if you run enough analyses, the odds are great that you will eventually find significance just by chance. A result that is significant at .05 happens one time out of twenty.

We started with reasonable hypotheses and they did not pan out. MY hypothesis is that people lie about their weight. This is further supported by the fact that the distribution of reported weight did NOT resemble the population distribution at all. It was only slightly negatively skewed rather than showing a large percentage of obese people.

My conclusion was that people are big fat liars. This is also borne out by many years of coaching judo, a sport where people compete in weight divisions so at the end, they DO have to get on a scale and I know what they weigh.

It may be that whole fat milk or Coke or Pepsi causes obesity but we just don’t know because our outcome variable is measured unreliably.

I think our plan B is to use a different data set with old people and use death as the outcome variable. After a year or two, it’s usually pretty easy to determine whether people are dead or not. They’re the buried ones.



Oddly, Sandra Scarr, author of both the seminal study on IQs of black children adopted into white homes and the book on child care, “Mother care/other care” gave this advice to working mothers. I quoted her years ago in a chapter I wrote for a book used in Freshman 101 courses. The title (of the chapter) was “Handling the triple threat: How to hold a job, raise a family and still be sane by graduation”

Or something like that. I wrote it a long time ago.

Anyway… it is odd that I found myself in this same over-scheduled situation lately and came to the same conclusion. We have recently added an administrative assistant, two graphic designers, a second developer and just this week a new intern (welcome, Chelsea!)

There is a tendency in small businesses, perhaps even more so with consulting, to try do to everything yourself to save money. I would say that is exactly wrong. Personally, I try to never do anything myself if I can pay someone less than my hourly rate to do it. Here are a few examples:

In addition, those last two are things I just suck at. I can hear you saying as a small business owner,

“But I don’t have that much typing, data entry, filing, etc. to keep someone busy. Maybe it is an hour each day.”

Fine. Pile it all up and ask someone to come in one day a week for five hours. In that extra time, you’ll probably generate enough work so it is six or eight hours before too long.

Another problem is training people. I suck at that, also. This is one reason, by the way, that employers, especially small businesses, don’t want to hire someone who is over-qualified. If you are so good that you should have my job – well, the company only has one president and I am not leaving, so eventually, you’ll probably leave for a better job and I’ll have to hire and train someone else. If you are lucky, you’ll have someone else in your office who can train people, too.  (Honest, Jenn, I was not deliberately out of the office on Marisol’s first day – I SWEAR!)

What if you have so many varied tasks that come up occasionally? You can do what Hollywood types sometimes do with their personal assistants – have them listen in on your calls. I don’t really recommend that for consulting, especially if you do anything confidential. What I am doing, though, with a person who is training for a management position is forward to her many of the emails and presentations that I do, so she can learn by example. I have to do these things anyway and the next time around, or the time after that, I hope she will do them for me.

Similarly, as you train someone new, have them proof-read your memos, contracts, presentations, articles – whatever it is you’re writing.

If you are bringing in a new person on a programming project, have them start at one end or the other, documenting the code, doing the first crack at the data cleaning. Remember Vygotsky and his whole scaffolding thing? (If not, click here to read something I wrote over 15 years ago when I was just as  much of a smart ass as I am now.)

The other way of buying all the help you can afford (programmers listen up here) is buying off-the-shelf solutions, everything from 3-D models to pop-down menus for website templates. There are people who have an almost religious aversion to paying for software. I am not one of those people.

I pay for a professional license for surveymonkey because it includes skip logic, customized re-direct pages for survey end and disqualification and download as an SPSS file. The amount of coding this saves me far exceeds the cost of the annual license.

I use Dreamweaver in part because I have used it for years and learning a new package would take time, which, when you bill by the hour, is money. If you are a consultant, repeat after me,

“My time has value.”

I feel bad for anyone trying to break into a new market because to take away customers, they are going to have to offer people like me enough to make learning a new package worthwhile.

On the flip side, whenever I am doing anything new, I look around for products I can use, whether free or for sale, that will cut out some of the work.

This brings me back to my rant about Codecademy that I have not gotten around to writing. One of my objections to it is that you write code from scratch. While I have several times recently started with a blank file in TextWrangler just for fun and to learn something new, that is almost never the way I do a project of any size for professional reasons.

I will look either on my computers or on the web for a program that does something similar to what I want and then modify it.

When I start a new project, whether it is a programming project or web design, my first thought was,

“What products exist – IDE, statistical software, content management system, etc. etc. — that will cut down the hours required to get this done?”

Recently, The Rocket Scientist, who has given up working for the Dark Side to work for The Julia Group (we have Chardonnay) asked about buying a piece of software. When he started to give me the “business case” for it, I stopped him and said,

“Please, just buy it. Your time has value.”

What if you don’t have any money, though, because you have no work? Well, in that case, I guess your time doesn’t have any value. That sounds like a personal problem to me.



Years ago,  I happened to be at a meeting where two well-known African-American researchers were speaking. Like many conferences, there were “round table” luncheons with the famous people that you could attend. I forgot what the topic was supposed to be, but there were a couple of young African-American women at the table and somehow the topic got to racism. Famous Professor A, who must be about 80 years old by now, said,

“I really think it was easier back in our day. Racism then was blatant. You knew what you were up against. People would just say to you right out that they wouldn’t hire you because you were black.”

We all look puzzled, but Famous Professor B nodded understandingly and agreed,

“Then you’d go to all of your friends and say, ‘They said they wouldn’t hire me because I’m black’ and all your friends would have your back and say, ‘Those racist dogs! You’ll succeed despite them! You’re better than them!”

Professor  A went on,

“Now, they don’t tell you that, they say, ‘You’re unqualified. You don’t meet our requirements.’ No one ever says they don’t want to hire you any more.”

Professor B nodded again,

“No one goes back to their friends and says, ‘They wouldn’t hire me because i’m unqualified.’ They’re ashamed to say it, or if they do, there’s just this awkward silence and your friends say, well, I’m sure you’ll find something.’ I think it is harder for people now because with subtle racism you don’t question the system, you question yourselves.”

Professor A said,

“You know what, though? These young people are JUST as qualified as I was at their age. No one told me I wasn’t qualified. They just told me I was black.”

Professor B nodded for the third time.

I’ve never forgotten that conversation.

Personally, I had some incredible luck in mentors in my Ph.D. program. Prior to that, though, as a young engineer, as an MBA student, I seldom had the same kind of experiences. Which, now that I think about it, seems kind of funny.  At 23 years old, I was an MBA, working as an engineer, with a couple of years professional experience. In retrospect, that kind of seems like the type who would be “qualified” and attract some mentoring. Was it racism? Sexism? Or just that I was pretty much of a pain in the ass know it all (I was, but I think that is to say I was 23).

I noticed over the years that the junior people I tended to co-author papers with were predominantly Hispanic.  So, I cut it out. I didn’t cut out co-authoring papers with students or encouraging students to present. What I did was cut out encouraging individual students. Now, I make a blanket announcement to my class that anyone who is interested in presenting their research should contact me. (I can make this offer because I seldom teach more than one class a year. I have a company to run.)

Because both the Western Users of SAS Software (WUSS) and SAS Global Forum offer student scholarships, I forward links to those to the class mailing list.

I was reminded of that conversation recently because two of the students took me up on that offer this year. Their paper was excellent and it was accepted for the conference. Coincidentally, they are two young, African-American women.

And they are DEFINITELY qualified!




When I send bills to clients, I often, both for their information and mine, include the initials for the staff members being charged to that project. Of course, people are charged at different rates and this helps us internally keep track of who is being charged to what project and also allows us to easily respond to any questions a client may have about billing. Usually they know who is working on their project because they are working face to face, but if it is a research assistant checking references, for example, they may not.

Sending out an invoice tonight I noticed the names

AnnMaria (me), Castillo, Ortiz

In my in-box, two people who are working on another project – Flores, Ochoa

Also in my in-box,  email from another Flores (no relation to the first).

Our medical director is a Flores (he is related to the first).

What exactly is going on here?

Well, the Ortiz has a journalism degree from NYU, teaches at Tufts University and has a long list of publications. She is also my daughter. When I need top-notch editing, I turn to her.

Also, my second daughter, Jenn, is doing a few weeks of work for us this summer. Jenn has a history degree, a teaching credential and masters from USC. She has taught middle school for three years. When I needed someone to fly to Washington, DC to do a bit of research on social studies education for middle school students, I immediately thought of her. She went to a Catholic elementary school that was 85% Latino.

The Castillo is her best friend from those days and when I was looking for a research assistant, she was available.

The Ochoa is another of Jenn’s friends, someone she met in college in San Francisco, and when I was looking for an animator, he was available. I had seen his work previously because Jenn had shown it to me, so when I needed an animator, I thought of him. He has an art degree and loads of experience with Flash and other 2-D animation.

The Flores is my old friend’s son, from way, way, way back. I knew the family before he was born, saw him grow up, saw his mega-talented art work and this is now the fourth project for which I have hired him. He has an art degree and loads of experience with graphic design.

The other Flores is a former co-worker I used to drink beer with a few years ago. He has a degree from USC, a ton of experience and is supremely reliable.

So … if I have no trouble finding Latino staff members, what does that say about affirmative action? Well, as I said on this blog previously, I’m not 100% sure I’ve ever met another Hispanic statistician. I don’t know a lot of business owners who are Hispanic who are in technical fields – and I get out a lot.

None of those jobs were advertised. The connection that all of those individuals had is that they knew me or they knew someone who knew me. Don’t get me wrong, they are all also very talented, intelligent and reliable. If you know me and I know you are an idiot, it doesn’t help you.

My point is that I am pretty sure that most small business owners are exactly like me. When you have a small company every hire is important. We can’t afford deadweight and if you are really outstanding you’ll make a great difference here, not like working at General Dynamics and making their stock go up .0001 point after 30 years.

So …. I would bet that there are fewer opportunities for Latino young people starting out (and Native Americans and African-Americans) because they know fewer people who are in a position to give them that start. It’s not that white business owners start out saying,

“I’m only going to hire white people.”

I certainly did not start out saying,

“We are going to have Latino preference.”

We don’t. We also hired two people this year who are Native American and I am sending a contract out to a third this week. Again, they were people I knew and people who knew people I knew.

My point is that many white business owners and employers say they, personally, do not discriminate. I believe them. However, they point to the fact that neither they, nor anyone they know, deliberately discriminates as proof there is a level playing field. I’m not so sure.

Yes, I realize that my personal experience is not a random sample. Yes, I do know that the plural of anecdotes is not data. It makes me wonder, though, when every name on the invoice is a Hispanic one,  how often the flip side happens in other companies. You honestly, sincerely look for people who are competent, have the right degrees, the right experience, good references from people you trust – and somehow all of the people you hire look like you.

Me and my sisters



I learned at St. Mary’s Catholic School that envy was a sin. Good thing I am going to mass on Thursday, because I am suffering from a good deal of envy lately.

First there is blog envy.

If you have not checked out The Berkeley Blog, I highly recommend it, starting with this post on the Geography of Inequality.  Although the fact that it is written by “175 professors and scholars” and mine is written by one person who can occasionally muster up a facsimile of scholarship on a good day assuages my green pallor somewhat – still – their blog is better than mine and I am envious.

Then, there is article envy. I wanted to write an article on power analysis for mixed models. This is something more people should write about because I have looked at what some of my clients have had recommended for power analyses and they were – what’s another word for “stupid”? Not my clients. They were the opposite of stupid because they immediately recognized that even if they weren’t sure how to do a power analysis for a mixed model that it was probably not the exact same whether you were doing an independent t-test or a hierarchical linear model with correlated data and two levels of nesting. So, they came to me for a second opinion.

There should be a lot more written about power analysis for multilevel models. Bethany Bell and some other people wrote a really, really good paper with the title “Dancing the Sample Size Limbo with Mixed Models: How Low Can You Go?” I am even envious of the title.

James Coraggio and a whole bucket of co-authors from the University of South Florida wrote a macro I wish I had written,  GLMM_SIM: A SAS ® Macro for Evaluating the  Statistical Integrity of General Linear Mixed Models. I tried running their macro and did not really get the results I expected but I am pretty certain that is not due to any defect in their macro nearly as much as to the fact that I spent about two minutes and thirty-two seconds on it before I got called away to try to meet some deadline.  I’m sure I’ll get back to it in the next week or so. I have macro envy.

The one thing all of those have in common, I noticed, is they are all written by more than one person.

Me and Julia reflected in mirrors

Great! Now, I have clone envy.



I thought it would be really interesting today to get back to looking at SAS Enterprise Miner, especially the text mining.  I thought it would be cool to take another look at Statistica. I also thought it would be cool to finish a blog post I started on power analysis for mixed models. And I thought it would be fun to write something on propensity scores and logistic regression. I really wanted to get back to the game I was writing in javascript, which is working but nowhere near as cool as I want.

I have been struggling NOT to take a new contract that I know would be really cool and interesting  because I know I am booked 110% already. I have had the offer in my in-box for days and I would LOVE to do it but …

This is all very odd because I went into business with the sole intent of doing cool and interesting stuff. A funny thing happened, though.

The business started to grow. So I hired a part-time person to do data entry, travel reports, typing, filing and other stuff. It grew some more. So I hired a developer. And two consultants to do research. Two graphic designers to do graphic design stuff. A consultant to do translation. A project manager.

So, instead of any of that, here is what I did this weekend:

Um, did I mention this is the weekend?

I got into this same position many years ago. I was teaching college full time and running a consulting company on the side, with ten people working for me. Eventually, it got to the point that my assistants were passing me up in technical skills because I was spending so much of my time in meetings while they did all of the fun stuff.

It sounds like I’m complaining but I’m not really. I also went to The Grove and had a lovely lunch at a French restaurant, bought The Spoiled One clothes for the new school year at Abercrombie & Fitch and Nordstrom’s, assisted by what appear to be Stepford Salespeople.

My second daughter, The Perfect Jennifer, who is a middle school history teacher, was with us and commented,

“I’m not comfortable at The Grove. It is full of people with money, shopping.”

I pointed out to her that we, in fact, were shopping at The Grove. So, there are pay-offs to not always doing exactly what you want.

Still, tomorrow, after I send off a few memos and make a few phone calls, and before my last meeting, I’m going to make sure I get some programming done. Because, if I don’t get to do SOME of the fun and interesting stuff, what’s the point, really?


keep looking »


WP Themes