Yesterday, I talked about using a macro for beginning checks on data quality and I promised to explain it today. So, here we go…

options mstored sasmstore = maclib  ;

If you want to store your macro, you need to use two options in the OPTIONS statement.

MSTORED that you want the stored compiled macros in the library specified after SASMSTORE = .

Why you need two options, I don’t know.

LIBNAME MACLIB "C:\Users\MyDir\Documents\My SAS Files" ;

*** That’s the directory where my macro will be stored.
%macro dataqual(dsn,idvar,startvar,endvar,obsnum) / store ;

*** This creates a macro named dataqual  that will take parameters for

dsn = data set name

idvar = the subject identifier in the data set . This is something like social security number, telephone number  (if you’re a phone company), customer number or other variable that should NOT have duplicates.

startvar = this is the first variable in the data set that I want to get descriptive statistics on

endvar = this is the last variable I want descriptive statistics on

obsnum =  the number of variables in the data set

The / store  is an option that tells SAS to store this macro, and it will be stored in the directory specified after the sasmstored option and matched in the LIBNAME statement.

Title "Duplicate ID Numbers" ;
Proc freq data =  lib.&dsn noprint ;
tables &idvar / out = &dsn._freq (where = ( count > 1 )) ;
format &idvar ;

*****    This will create a frequency distribution but not print it (that ‘s the noprint option on the PROC FREQ statement ). This is important to remember because many of my data sets will have hundreds of thousands or millions of records, each with a unique identifier. It will output the duplicate values to a data set named &dsn._freq , with, of course &dsn replaced by whatever name I give it.

Since many of the public data sets I use have formats for every value, and I don’t want the formatted value used, the statement

FORMAT &idvar ;

will cause it to use the unformatted value for &idvar .

proc print data = &dsn._freq (obs = 10 ) ;
run ;

*** This prints the first ten duplicate values. You want to be careful to put in that (obs = 10 ) just in case something went wrong in the FREQ procedure and it ended outputting everybody. For example, if you accidentally put (count = 1 ) . Then you may get 2 million records in your &dsn._freq data set and it would not be very good to print all of those out.

(Skipping story of dumping reams of green and white lined computer paper. If you’re my age, you have one of those stories of your own.)

proc summary data = lib.&dsn mean min n std ;
output out = &dsn._stats ;
var &startvar -- &endvar ;

**** This is going to create a data set , &dsn._stats with the mean, minimum,  n and standard deviation for each variable in your data set from &startvar to &endvar /

proc transpose data = &dsn._stats out = &dsn._stats_trans ;
id _STAT_ ;

This is going to transpose your dataset so that instead of five records with 200 or 500 or however many variables you have, instead you have 500 records with variables for _name_ , _label_ , minimum, mean, n and standard deviation.
data &dsn._chk ;
set &dsn._stats_trans ;
pctmiss = 1 – (n/&obsnum) ;
if min < 0 then neg_min = 1 ;
else neg_min = 0 ;
if std = 0 then constant = 1 ;
else constant = 0 ;
if (pctmiss > .05 or neg_min = 1 or constant = 1) then output ;
Title “Deviant variables to check ” ;
proc print data = &dsn._chk ;

**** This reads in the transposed data set and then does some quality checks – if the standard deviation is 0, more than 5% of the data are missing or there is a negative minimum value, the variable is output. Then, I get a listing of variables that are in some way warranting a second look. The columns will show the descriptive statistics, plus new variables that show the percent missing, whether the variable is a constant or has a negative minimum.

Title “First 10 observations with ALL of the variables unformatted ” ;
proc print data = lib.&dsn (obs= 10)  ;
format _all_ ;

run ;

****  You should always stare at your data. This prints out the first 10 values without formats, just so I can see what it looks like. I use a lot of public data sets that come with user-defined formats. The FORMAT _ALL_  statement removes all formats for this step and will print just the unformatted values.
Title “First 10 observations with ALL of the variables formatted ” ;
proc print data = lib.&dsn (obs= 10)  ;
run ;

*** This prints the first 10 observations with formatted values, obviously. Using formatted values is the default so I didn’t need a FORMAT statement.

Title “Deviant variables to drop ”  ;
proc print data = &dsn._chk  noobs;
var _name_ ;

****  This prints the names of all of those variables that have problems. I can copy this from the output window, type the word


paste the list of variables, add a semi-colon and I have my drop statement.

%put You defined the following macro variables ;
%put _user_ ;

**** This just helps with trouble-shooting. It will put the user defined macro variables to the log so it looks something like:

dsn :  studentgr8

idvar: IDSTUD

and so on.

run ;
%mend dataqual ;

*** and that is the end of my macro.

I hope you love it as much as I do. While it isn’t all fun sexy cool like doing a proportional hazards regression model or even a factor analysis, if your data are no good all of those fancy statistics are at best, wrong and a waste of time, and , at worst, wrong and a major blow to your career depending on whose life your wrong data screwed up.



I will be missing my lovely daughter Ronda’s fight in Calgary next month because I will be attending the National Association of Hispanic Journalists conference in Orlando, Florida, where sports journalist and equally lovely daughter Maria Burns Ortiz will be. It is completely false that this proves Maria is the favorite child. In addition to which, Julia claims to be the favorite child because “I’m the baby, you gotta love me & you named your company after me when I was a baby, so there!” The perfect Jennifer just smiles because she KNOWS she is the favorite child, after all, she points out, she is the only one with a graduate degree.

They are all wrong. I do not have a favorite child.

I do, however, have favorite macros.

Yesterday, I wrote about how I start my programs to test data quality. In fact, there’s a macro that I call at the beginning of each program that does all of that stuff:

%strt(TIMSS,G8_STUDENT07,studentformats) ;

I like it okay, but I don’t love it. The code is below.

%macro strt(Projdir,dsn,fmts) / store ;
options nofmterr ;
libname LIB “C:\Users\AnnMaria\Documents\&projdir\sasdata”;
proc means data = lib.&dsn ;
%include “C:\Users\AnnMaria\Documents\&projdir\sascode\&” ;
%put You defined the following macro variables ;
%put _user_ ;
run ;
%mend strt ;

This macro that goes at the end is one of my favorites. I enter one line in my program like this:

%dataqual(student_int,IDSTUD,BS4GOLAN,USBSGQ6B,7377) ;

This one line results in some very useful stuff:

Why that last one? Seriously, you don’t think I’m going to type 300+ variable names do you? I’m a terrible typist. After I look through the results above, I will usually decide to delete most of the variables. There are usually a few I want to keep – for example, father’s education might be missing for more than 5% of the sample but that’s because some people don’t live with their fathers and may not have the data, so I might keep that variable despite missing data.  Besides, 5% is a low threshold, so I may keep variables that are missing 10% of the data. For some variables like income, a negative number can be correct. After examining the results, I can copy and paste the whole list of variables into a DROP statement, and just delete the few I want to keep.

The code for the macro is below. Tomorrow, I’ll go through what each statement does. I would do it today but this post is already long and besides, it’s powered by Chardonnay and we are all out of wine. The resident rocket scientist, who usually brings home the wine as well as the bacon, is on a diet so he’s been bringing home diet soda instead. Good for him, but what about ME, what about MY needs, what about people who need an explanation of

FORMAT _all_ ;

Apparently he did not consider that. Ha!

/* =========^===============================================**=============== [This line fulfills no purpose */

options mstored sasmstore = maclib  ;
LIBNAME MACLIB “C:\Users\MyDir\Documents\My SAS Files” ;
%macro dataqual(dsn,idvar,startvar,endvar,obsnum) / store ;
Title “Duplicate ID Numbers” ;
Proc freq data =  lib.&dsn noprint ;
tables &idvar / out = &dsn._freq (where = ( count > 1 )) ;
format &idvar  ;
proc print data = &dsn._freq (obs = 10 ) ;
run ;
proc summary data = lib.&dsn mean min n std ;
output out = &dsn._stats ;
var &startvar — &endvar ;

proc transpose data = &dsn._stats out = &dsn._stats_trans ;
id _STAT_ ;
data &dsn._chk ;
set &dsn._stats_trans ;
pctmiss = 1 – (n/&obsnum) ;
if min < 0 then neg_min = 1 ;
else neg_min = 0 ;
if std = 0 then constant = 1 ;
else constant = 0 ;
if (pctmiss > .05 or neg_min = 1 or constant = 1) then output ;
Title “Deviant variables to check ” ;
proc print data = &dsn._chk ;

Title “First 10 observations with ALL of the variables unformatted ” ;
proc print data = lib.&dsn (obs= 10)  ;
format _all_ ;

run ;
Title “First 10 observations with ALL of the variables formatted ” ;
proc print data = lib.&dsn (obs= 10)  ;
run ;

Title “Deviant variables to drop ”  ;
proc print data = &dsn._chk  noobs;
var _name_ ;
%put You defined the following macro variables ;
%put _user_ ;
run ;
%mend dataqual ;



I used to think that a clean house was a sign of a wasted life. After all, why would anyone spend time cleaning when they could be reading, writing, teaching, running, programming, doing judo or drinking margaritas at the pier watching the sun go down?

Well, I still don’t alphabetize my spices or clean the grout with a toothbrush. In fact, the grout and I have pretty much adopted a truce. It doesn’t think about me and I don’t think about it.

On the other hand, I’ve found that having  the place cleaned up actually saves me time because I waste a whole lot less time looking for things.

The same is true of your programming. So, here, as a public service on behalf of mothers everywhere are several tips to clean up your programming.

1. Get rid of all the useless junk.

When I’m first starting a project, I may run the same program over and over. It runs and there is a 200-line input statement because it has several hundred variables and they just could NOT be named something like Q1 – Q800, now could they? At line 312, it turns out that I misspelled the libref. I did something brilliant like spelled “in” with two i’s . So, now I need to re-run it, and when I am reviewing my log, I really don’t need those first 400 lines with my 312 statements and a bunch of SAS notes. I want to start all over.


I ran several analyses to better understand different variables. I do this all of the time. For example, with the TIMSS data on students, there are two variables on student sex. Neither are as interesting as one might hope. One is the sex recorded on the student’s permanent record (apparently it does exist, and, at least they got your gender right). The other is the sex the student put on the survey. I did a cross-tabulation and found that what the student put always agreed with what was in the administrative record (not even one smart-ass in a group of over 7,300 eighth-graders – amazing) but 58 students did not answer this question. So, I’m going to keep the ITSEX variable and drop the other variable. All well and good, but I did several cross-tabs like that, with birth month, birth year and so on. I don’t need to keep those results cluttering up my output window.

I put this statement FIRST in the program when I am starting on data cleaning:


If I re-run the program from the beginning, it erases the log and output from the previous runs, leaving only the results from the current run.

The second line I put in a data cleaning program is:


At a later time I’ll probably be interested in any user-written formats but this isn’t the time. I don’t want my program stopping and giving me a bunch of errors because I don’t have some special format.

Often, the programs I use do come with special formats and a PROC FORMAT with hundreds of lines of VALUE statements. I run that once, see that there are no errors and then move that out of the way. So, the third line I put in a data cleaning program is often

%INCLUDE “c:\projectdir\sascode\” ;

2. Do not whine that you’ll do it later.

The fourth line in my program is usually PROC MEANS

At this point, I’m only looking for one thing and that is if values are out of range, for example, the items are scored on a 1 – 5 scale and the maximum of many of the variables is 8. Occasionally for perfectly good reasons other than to annoy me, people code data as 8, 99 or whatever to show that it was “not administered”, “not answered” , “did not know” and so on. There are some interesting analyses that can be done of patterns of missing data, but now is not the time to do them. If you want to use those missing value codes later, good for you, but having seen SO many times that these screwed up results, I’m going to set them to missing in my analytic file right away.

(A side note on that – I read in the raw data and save it as a SAS dataset. Then, I create an analytic file. So, I can always go back to the original raw data if I need it.)

The next step is to include an ARRAY statement and DO – loop

ARRAY somename{*}  beginvar — endvar ;

In case you aren’t familiar with this …

somename is the name of the array

{*} means to set the dimension (that is, the number of elements in an array) to however many elements there happen to be.

This will put all of the variables from beginvar to endvar in the array.

DO I = 1 TO DIM(somename) ;

IF someneame{i}  >  5 THEN somename{i} = . ;


3. Get rid of more useless junk

It makes a lot of sense in collecting data to ask anything you might possibly ever want to know, because it is frightfully expensive to track down the same subjects again. The result, though, is that you may end up with a whole lot of variables that you are never going to use. The first type is where there is no variance. If everyone in your dataset is in the eighth grade and from the USA then you have all of the information you are ever going to get from the “grade” and “country” variables.  Think about it, no matter what you do from here on out, if you pull out a sample of females only – you’re still going to have only American eighth-graders. No matter what kind of correlation, factor analysis or other statistical technique you do, “country” is going to explain exactly none of the variance. Make a note somewhere about the zero variance of those variables and drop them from your dataset.

There is a reason I did these steps in this order. Often, when I recode the data so all of the various missing categories get coded as missing instead of 8, 9 , 742 or whatever, the variance drops to zero or the number of records with non-missing data drops to zero.

4. Think about what you’re throwing out

Is there a good reason that those items have all missing data or the answers are all the same? It may be that the question was if the student had dropped out of school and, very reasonably, it is not asked of third-graders and you are only analyzing data for the third grade.  For obvious reasons, the American Community Survey only asks women if they gave birth in the last year.

5. Use the vacuum

In other words, use machines to help you. I automate the four steps above by combining PROC TRANSPOSE, PROC MEANS, PROC FREQ and a couple of macros.

However, that’s the topic for tomorrow’s post.



Yesterday, I wrote a post asking if we really suck at math and questioning the value of spending 15 hours a day, seven days a week, from kindergarten to tenth-grade on the path to achieve a perfect math SAT score.

Today, I read the last post by Derek Miller. It begins,

“Here it is. I’m dead ….”

If you haven’t read his blog, which, tragically, will now be an archive, I highly recommend it. He wrote publicly about his struggle with cancer and his last post was written to be published after he had died.

It was a very sad, odd juxtaposition of pieces to read – the mother who is encouraging her son to live every day for tomorrow, and the man who died yesterday.

Over eight months ago, I resigned my university position, where I’m sure I could have retired or died at my desk, after the gradual realization sunk in that we had done well enough over the last decade that I could afford to just work on what I wanted for the next few years, and maybe forever. Having been such a workaholic all of my life (you have no idea), leaving home at age 15 and supporting myself ever since, it was an unprecedented transition. Even now, when people ask me,

“How do you like working at a slower pace like that?”

I still answer,

“You know, I don’t really know how I feel about it.”

It’s probably true that, as my niece noted, my slower pace is most people’s normal life, but that’s not the point. For me, not being focused on taking EVERY contract that comes along, EVER turning down work and not being in a new city so often that I have no idea what to put on websites that asks for my time zone – well, it’s been a mid-life crisis level shift.

It did occur to me today that a year ago, I probably wouldn’t have had the time to read Derek Miller’s blog, get my hair cut, read the LA Times and talk to my daughter, Ronda, all within 24 hours. Tom, the gentleman who cuts my hair has owned Ambiance Hair Salon in Santa Monica for over 25 years, and been in business for over 40 years. He has a pretty successful life. [Anyone who is on Main St., two blocks from the Pacific Ocean, didn’t make too many wrong turns in life.] We talked about the importance (or not) of perfect SAT scores. He said that he had enough of school after high school, he always wanted to be in business for himself and that’s what he’s done.

We were talking about the LA Times article and Ronda said,

“Well, she just wants her kid to do well on his tests so he can get into a good college and have a good life. I think she means well, but I wouldn’t want to be her kid.”

Is it true that if you don’t ace your SAT scores you fail in life? I thought about this as I drove down Lincoln today and saw all of the businesses along the street. I really doubt that all (or most) of the people who own the businesses had perfect SAT scores. I thought about it again as I drove back from Ronda’s house. She is the only one of my children who did not graduate from an “elite” university. In fact, she didn’t graduate from college at all. She went to the last two Olympics and now is competing in mixed martial arts. I like her house. It’s really pretty and comfortable. She shares it with a couple of very nice people and three dogs.

I concluded that one major advantage of not working at breakneck speed any more is that it gives me the opportunity to actually think about life, the past and the future, rather than just barreling through. Thinking about why I would not follow what the mother featured in the LA Times did, I remember a professor, J.T. Dillon, in graduate school who made us write what I thought were the most bizarre assignments for a course on Curriculum and Instruction. We had to answer these questions:

What is the good man, living the good life in the good society?


What must everyone learn and why must everyone learn that?

Over twenty years later, I realize the wisdom of what he was trying to teach us. His point was that we should ask WHY. Why do students need to know the answer to every math question on the SAT? Is it all about doing things we don’t like so we can make money to buy stuff we don’t need to impress people we don’t like?

I actually had a very close to perfect math SAT score, went to college at 16 and made a fair bit of money. However, I sure as hell did not spend 15 hours a day seven days a week studying and I’m sure if that mom in San Marino knew what I actually did from kindergarten through college, she’d be appalled. I studied stuff that interested me, competed in sports, went to a lot of parties, read a lot of books that had no purpose other than I felt like reading them. I played piano in duets with my brother and we pushed each other off the piano bench. I dug in the dirt, camped, hiked in the woods and generally was a kid. I skipped school, went skinny-dipping and got picked up by the police for various things the police pick you up for when you’re a teenager.

AND YET, if I had died at 19, I would have had a pretty good run. By that age, I’d graduated from college, set a few school track records, won a national judo championships and spent a year living in Japan.

If I’d died at 31, I would have been about the age of the oldest son of the woman in that article, the one who was pointed out that it was proof her system worked because he was in graduate school at MIT. By 31, I’d worked as an engineer for several years, had three children, finished two masters degrees, my Ph.D. and won a world judo championships. I’d also gone skinny dipping in Lake Havasu with my husband, gone hiking in the mountains, run several 10Ks just for the hell of it, trained in Switzerland, France, Japan and England. I’d have been sad to have missed my children growing up, but if I’d died at 31, I’d still know I’d had a pretty good run.

Derek Miller died at 41. Undoubtedly, he was sorry to leave behind his wife and children, but he lived a good life.

THIS is what I mean by balance, it’s not doing whatever you feel like at the moment but it’s also not putting off living until some later date. If you haven’t seen the video by Gretchen Rubin on the Years Are Short, I think it should be required viewing for everyone in the maternity ward.

Childhood, adolescence, young adulthood  – it isn’t preparation for life. It IS life.

That is one of the arguments I have against focusing on perfect SAT scores. Alex de Tocqueville said,

“As one digs deeper into the national character of the Americans, one sees that they have sought the value of everything in this world only in the answer to this single question: how much money will it bring in?”

Less than perfect math SAT scores do not preclude you from bringing in a decent amount of money – as Tom proves by his hair salon near the beach. Still, I am perfectly willing to concede that perfect math scores do increase the probability of making a lot of money, in large part because they increase the probability of getting into a “good” school which increases the chances of you getting a high-paying job when you get out. Is it worth living the first third or more of your life as nothing more than focused on making a lot of money? I would say not.

Is the difference between making the honor roll every semester (which all of my children are expected to do) and making perfect grades worth giving up sports, hiking, laying on your back looking at the clouds and sleepovers with your best friend? I would say not.

In the end, I would say, the main advantage I have gained from cutting my work hours in half is the time to ask questions about WHY.

The real irony of this is that Dr. Dillon’s course would not at all fit with our current emphasis on benchmarks or learning outcomes or marketable skills. And no one ever seems to ask, “Why not?”



“If everyone knows a thing it’s almost for sure it aint so.”

“It’s not so much the things you don’t know that hurt you as the things you know for sure that aint so.”

I don’t take anyone’s word for anything. Take those quotes, for example, which I’ve both heard attributed to Mark Twain, Will Rogers and several others, ironically by people who were just certain they were correct.

One thing everyone knows is that Americans suck in math. We are so far behind Asia, we are continually told, that we are soon all going to be learning how to say, “Would you like fries with that?” in Chinese.

There was an article in the Los Angeles Times today that profiled a mother who had an Excel spreadsheet with a schedule for her child from 8:00 a.m. to 11 p.m. seven days a week. She said she started in kindergarten, because life is hard and students need to learn to deal with it. Her son, as a tenth grader, scored a perfect 800 on the SAT. Rather than convincing me further that we suck at math, it made me question the goal of propelling a child  to perfect scores.

One thing writing a dissertation on intelligence testing taught me is that test scores are very, very far from absolute and objective. Two critical points to keep in mind:

1. Some group of people decide what is tested, inevitably the group of people that has the most power. If we insisted that being fluent in more than one language is a factor in achievement scores, Hispanic children would be getting admitted to elite institutions in droves. Before you discard this as a silly notion, think about the arguments made for including high math scores – these are relevant to courses students take, to careers. An argument could be made for functioning in a global market place, for the ability to read texts in the original Spanish (or whatever second language a student reads).   I could write a whole dissertation on this – oh wait, I did ! – the point is we make decisions about what goes into the tests and those decisions favor some people and not others.

2. The scores we use to evaluate both at an individual and larger (school, country) level are almost never how many questions were answered correctly, which you might logically think is your test score. There you go with the logic again. Cut it out. In fact, scores depart several steps from the number of correct answers. First, there is the issue of partial credit, yes or no and if yes, for what. Second, there is the step of standardizing scores. Usually this means setting the average at some arbitrary value, say 100.  If the average student gets 17 questions right, then that is set as a score of 100.  The standard deviation, the average amount by which people differ from the average, is also set at an arbitrary value, say 10.  (If you’re not familiar with these ideas, think of your family. We’re kind of short in my family, and if you went to a family re-union you’d probably find that the average woman is around 5’3″ give or take two inches. So, you can think of five feet three inches as being the average and two inches as being the standard deviation. If you are reading this and from a country on the metric system, 5’3″ is equal to a furlong plus a bushel, a peck and a hug around the neck.) To return to my long-forgotten point – if 84% of the people score 22 points or lower, than answering 22 questions correctly is given a score of 110. (The mean of 100 +  one standard deviation of 10).  The scores you see reported aren’t that closely related to the number of questions answered correctly and they tell you almost NOTHING about what precisely people do or do not know.

I think most statisticians know this. I am certain that nearly everyone who does analyses of educational tests knows this. But I am equally certain that the average person reading the newspaper does not. This is important because it has to do with our sucking or not.

My assumption, based on what I read in the papers and hear on TV is that American kids just don’t know basic math. So, I downloaded the TIMSS (Trends in International Mathematics and Science Study ) data and I also downloaded the items that had been released, to see what it is that American kids do and do not know. Here are a few examples:

Students were shown a  rectangle divided into twelve squares. Five of those twelve squares were shaded. Then, they were given five choices of circles that were partly shaded and asked:

“Which circle has approximately the same area shaded as the rectangle above?”

To solve this problem you need to figure that the rectangle has 5/12 shaded and understand that 5/12 is a little less than one-half. (The figures show a circle that is 7/8 shaded, 3/4, exactly one-half, a little more than one-half and a little less than one-half.)

This question was answered correctly by 80.2% of American eighth-graders.

The next question asked :

A gardener mixes 4.45 kilograms of rye grass with 2.735 kilograms of clover seed to make a mix for sowing a lawn area. How many kilograms of the lawn mix does he now have?

This question was answered correctly by 71.9% of American eighth-graders.

I must admit that I was surprised the figure was that low, although not extremely surprised, since I know many, many adults and some young kids who never do math like this. Every phone, every computer has a calculator on it and they just think this is a useless skill, like cursive. I happen to disagree and the world’s most spoiled thirteen-year-old is not allowed to use a calculator to do or check her math homework.

Another question dealt with inequalities:

X/3 > 8 is equivalent to….

To get this answer correct, you need to understand the idea of inequality and how to solve an equation with one unknown. Essentially, you need to reason something like 24/3 = 8 so X > 24 . This, of course, presupposes you also know that 24/3 = 8.

This question was answered correctly by 42.8% of American eighth-graders.

A question that was answered by even fewer was:

What is the perimeter of a square whose area is 100 meters?

To answer this you need to know:

This question was answered correctly by 26.5% of American eighth-graders.

One last question,

A bowl contains 36 colored beads all of the same size, some blue, some green, some red and the rest yellow. A bead is drawn from the bowl without looking. The probability that it is blue is 4/9. How many blue beads are in the bowl?

This question was answered correctly by 49.4% of American eighth-graders.

Are these percentages bad or good? Honestly, I thought the questions were pretty easy and I was surprised by the low percentages on some of them – but I do math for a living and I was in 8th grade almost forty years ago. So, I have known this stuff a very, very long time. I THINK some of the questions were actually what was taught in ninth or tenth grade when I was riding a brontosaurus to school, so the fact that eighth graders today don’t know this information doesn’t convince me we’re all a bunch of drooling idiots.

Here is a blasphemous question for you –  Does it matter if you know the answers in eighth grade? I’m serious. Is it worth having your child study from 8 a.m. to 11 p.m. so that he or she knows all of this in the eighth grade instead of the ninth grade?

A few weeks ago, I was looking for data for a proposal I was writing and came across a state Department of Education website that had a note on its pages on test scores that said proficiency meant something different according to the federal government definition and that many people could function perfectly fine will being scored below proficient in math.

At the time I dismissed this as an excuse for poor performance. Today, when I looked at the questions and the results, I was not so sure. My two older daughters are a journalist and a history teacher. Both have degrees from good institutions (NYU and USC).  I believe neither of them could answer the question about finding the perimeter of a square with an area of 100. Perhaps they could have answered it when they took their SATs or while they were taking the one mathematics course they took as undergraduates. I’m not sure. I’m fairly certain if they ever knew this information, they’ve totally forgotten it. The truth is, as much as I hate to admit it, that neither of them at any point in their lives will feel the lack of this knowledge.

On the other hand, my daughter who knocks people down for a living (she competes professionally in mixed martial arts) could almost certainly answer these questions off the top of her head, just because she likes math and has always been good at it.

What percentage of Americans (eighth-graders or not) SHOULD be able to answer these questions?

I have no idea what the answer to that is.

Some people would say 100%,  because they need to know this information to do well on the tests to get into a good college.  I’m not sure that is true. More and more, people are asking WHY you need to do well on the tests. If I want to be a sportswriter or a history teacher or a doctor, what good does it do me to be able to calculate the perimeter of a square given the area?

I think the mother in San Marino may be part of an education bubble that will burst just like the housing bubble has. I am far from the only person to be suggesting this. Not only has the cost of higher education reached astronomical levels where it exceeds the cost of a home in most parts of the country, but it also, for selective institutions, is costing more of your life. Not only are fewer people going to be able to pay it, but, perhaps like the housing bubble, more people are going to say, “This isn’t worth it.”

I did not work from 8 a.m. to 11 p.m. I spent several hours today reviewing grants. Then I went running down to the beach, because it was a beautiful day. I had a Corona while reading the LA Times. I analyzed the TIMSS data and I watched The Daily Show. I also checked my daughter’s math homework and pointed out the one answer she had incorrect. She figured it out and fixed it on her own.

Life is not hard. Life is good.



The new structural equation modeling for JMP is pretty cool. It’s unfortunate that it requires both JMP and SAS/STAT to run it, the cost of the two combined being so expensive that you pretty much have to work for a huge organization that can afford a site license for both or sell a kidney to get the money to pay for it.

If you have read this far, I am going to presume that you have sold your soul to the corporate overlords (or university overlords and don’t pretend it isn’t the same thing) and have access to both site licenses. Either that, or you have an extra kidney. Whatever.

I did write about some of the pointing and clicking a few days ago, but that post was getting long and I was running out of Chardonnay, so I left off the statistical details for another day, and I just realized – hey, it’s another day!

Completely random statistical points:

It uses PROC CALIS. This is why you have to have SAS/STAT installed.

This is actually a good thing (other than the need for a briefcase full of cash), because there is a ginormous load of detailed documentation on PROC CALIS, starting with the the SAS/STAT manual. I’d never before encountered funding agencies asking for the equations used for a procedure, e.g., if you said you were doing a logistic regression using statistical package X, that was good enough. Twice this year, I have been working on proposals where they really did want that level of detail, and I recently saw another proposal that included that information, so I’m assuming they had the same experience. Either that or they just had a lot of spare time on their hands.

It seems rather a bizarre turn of events to me, but, then, maybe not. If you’re counting on these results to make policy, you ought to know how they are produced other than a black box. All of the statistical theory and equations behind CALIS output are available to anyone who has Internet access.

As you’d expect, you can get both standardized and unstandardized parameter estimates.

There are six estimation methods, including full information maximum likelihood (for more about FIML, you can see an earlier post “Damn! There really IS structural equation modeling for dummies!” )

There are a bunch of model fit statistics available and you can modify your user profile to only show your favorite ones each time. That JMP assumes you will HAVE favorite model fit statistics tells you all you need to know about their target market.

Saying it’s just like CALIS is pretty useless to you if you aren’t familiar with that procedure, so here are a few more bits about it

You can do a regular path analysis, just don’t include any latent variables

You can do confirmatory factor analysis, just don’t have any paths among latent constructs and set your number of latent constructs to be your number of hypothesized factors. If you want orthogonal factor analysis, set the covariance between the factors to zero.

Don’t be fooled by all the pointing and clicking and photoshop-like options such as increasing the canvas size so you can have a larger drawing area for your model. JMP’s new SEM interface may not require coding a series of equations like CALIS or LISREL but it is not to be confused with a tool that can be used by your average graphic designer (sorry average graphic designer people). It still requires some knowledge of what is an exogenous variable, endogenous variable, indicator, latent construct, model fit statistics, standardized versus unstandardized parameter estimates and estimation methods, just for a start.

For some (but not all) of the elements required, it has defaults, so you could possibly get a model to run without really knowing what you’re doing but I really, really wouldn’t recommend it. (If you ignore my advice, do it anyway and end up looking like an ass, don’t say I didn’t warn you.)



Mr. Chips is dead – and I think I helped to kill him.

There has been a lot of backlash directed at universities over the past few decades. Some of it is very blatant – like the Wisconsin legislators who want copies of a faculty member’s email, or the Georgia legislators who attributed their university budget deficit to “overpaid professors”.

Much more of it is subtle – the continual increase in the number of professors working at jobs that are part-time or non-tenure track positions, from 43% when I was a student 30 years ago, to over 70% now. Speaking of when I was in school, I remember more than once friends commented it was hard to imagine a career as a professor because,

“I mean, yeah, the faculty members are all brilliant and you really learn a lot in class, but are there any of them you’d want to hang out with?”

The answer was clearly no. Yet, even though I majored, or at least minored, in partying as an undergraduate, I, and all of my friends, had sincere respect for our professors. We admired their knowledge. We wanted to be in college. Most of us were attending a selective private university instead of community college in part because our parents encouraged it but in large part because – wait for it – we were honestly interested in learning, genuinely interested in the subjects we chose as a major and wanted to be somewhere we could be intellectually challenged and around other people who were also intelligent and intellectually curious. We even kind of liked the libraries and that relatively new innovation, computer labs (eventually with dumb terminals instead of key punch machines). This isn’t to say that we were paragons of virtue. Let me just suffice it to say that I did plenty of things in college it is best that my grandmothers died without ever finding out.

And yet …. we were not there as a job training program, and we understood that as well as our professors.

I am sure the change is a very long drawn out complicated process with lots of reasons. One, I think, though, is our own fault as academics. We have become irrelevant to a lot of people. A lot of the stuff we do is just plain stupid. I’m not as cynical as Mark Tarver, in his wonderfully accurate article on “Why I am not a professor”, but it is true that I, too, see many, many people who have written hundreds of articles that are just unnecessary.  They remind me of the people on twitter who have 23,000 tweets. Unless you are a record-setting porn star, you cannot possibly be doing something eleven times an hour that anyone else would be remotely interested in reading about.

Research articles are like twitter from spammers, where you get the same message over and over. You’ll find that those people who have published hundreds of articles have dozens with titles like, “Factor Structure of the KABC scale for Vietnamese-American students”, “Factor Structure of the KABC scale for Chinese-American students”,  “A comparison of factor structures of Vietnamese-American and Chinese-American students” and so on and on and on.

Even worse, it’s sort of like twitter with that annoying person in high school English who always said, “utilized” instead of “used” or “I am cognizant of that fact” instead of “I know”. Not only is our writing like that, an Elements of Style in reverse (E.B. White must be rolling over in his grave), but our statistics are just as bad, if not worse.

No one inspects a correlation matrix if they could do a confirmatory factor analysis using the parallel analysis criterion for determining the number of factors. Or wait, no that is too understandable, let’s use LISREL instead. I love computers and I doubly love super-computers because I am immature and it is like a big toy, BUT a negative result has been that it is now easy to do complex survey designs, multiple imputations, maximum likelihood methods and condescend to everyone who does not use the CORRECT methods. We never stop to ask (not within my hearing, anyway) if the added precision and power we are getting from these methods offsets the degree to which more and more of our research is not understandable by the average person.

I can explain the parallel analysis criterion in a few sentences so the average person can “get it”  – You do the analysis on a set of random numbers and find how much each item has in common with the others. It won’t be exactly zero. You’ll find some small relationship just by chance. When you do the factor analysis on real data, any factor you find that isn’t bigger than the ones you found with the analysis of random data, you figure is just by chance. That’s why it’s called parallel analysis. You analyze your dataset and the one of just random (made up) numbers in parallel.

Try explaining item response theory or multiple imputation or proportional hazards models or structural equation models in just a few sentences. Most of those techniques really interest me. In fact, I once left a position because it was just one repeated measures ANOVA after another and I got bored with it, BUT and there is a very huge but here …. it’s not all about me, it’s not even about impressing reviewers or my colleagues – and THAT is what most of being a successful tenure track professor  has become.  You read academic writing and think, honestly, why do we write like that? You look at our designs and think, “Where is the statistical equivalent of Occam’s razor or The Elements of Style?” I have had people lecture me in all seriousness about how IMPORTANT it is to use a semi-colon versus a comma or adhere exactly to the APA style manual.

And I bought into the whole thing, hook, line and sinker. Published articles, wrote grants, got a tenure track position. Damn it, I was good at this stuff.

I really liked teaching, but I knew, as everyone knows, that it isn’t the most important thing a professor does. It doesn’t even really count that much. What counts is getting articles published, getting funding for your research, so you can pay indirect cost rates to the university and get more articles published. We fooled ourselves into saying that people in business really use this stuff, we were doing something good for humanity.

Slowly, it dawned on me – who am I fucking kidding? I’ve run a business for years, I have plenty of friends in business and do analyses for lots of social service and educational institutions and no one uses 90% of this shit that is being cranked out every year. No one uses it. No one reads it except some poor graduate student doing a dissertation or your friends’ students who had it as a class assignment and most of them didn’t read it either. Except for the editors of the Journal of Something Nobody Really Cares About it doesn’t make the least fucking difference on God’s green earth to anyone if you use a semi-colon or a comma unless it is to end a statement in a SAS program or code JCL.

So, we write this stuff to be more and more inaccessible to the average person and tell ourselves (and them) how much stupider they are than us.

What happens when you’re six years old and you tell another kid he’s stupid over and over? After a while, he comes back with,

“No, YOU’RE stupid!”

If our high school students can’t do math, what do we do? We do another nationwide study, another fourteen demonstration grants about minute issues in mathematics education, children’s understanding of the plus sign and a cross-cultural comparison of how you use the Abacus versus Peruvian counting beads (which I just made up) to teach long division. Then we publish all of that using language, symbols and statistics that neither the teachers nor the principals fully understand and certainly not the students, parents or policy makers. And we’re SHOCKED, SHOCKED I SAY, when they want to cut funding for education. [And much of the same applies to any other area of research, not just education.]

These days, whether I’m talking to middle school students or programmers or writing this blog, I try to make whatever I say understandable. People pay me for this. Well, not the blog, I just write that as a public disservice and a substitute for standing on a soapbox yelling at random passersby, like that guy on Lincoln Ave that I once saw talking to a jar of peanut better.

It turns out that when you quit assuming that other people are too stupid to understand you, or, at the very least, that they should attend college for four to six years so they can understand you, they mostly quit thinking that the work you are doing is stupid, irrelevant or, at  the very least, not worth the time they would need to put in to get anything out of it.

The really amazing irony of all of this is that far more people read my blog, download my papers from SAS conferences, or attend presentations I give to schools or at software conferences than ever read an academic journal article I wrote or came to a paper at a scientific meeting.

Many of my former colleagues shake their heads sadly, wondering where I went wrong. I was such a bright graduate student, I had post-doc offers, tenure track positions, and now here I am acting like it doesn’t really matter how many papers you publish each year using Rasch models. …. while the state cuts another $500 million out of the UC budget and the number of full-time faculty at universities drops yet again, nationwide.

I, too, wonder what went wrong, but for different reasons.

« go back


WP Themes