I had a lot of questions about structural equation modeling with JMP and when I was in Las Vegas at SAS Global Forum a while back, I was able finally to hook up with* speak to Wayne Watson and get almost all of my questions answered.

This is Wayne. You should look for him because if SEM on JMP was Pokemon he would be Pikachu. **

What you need to use SEM on JMP

First of all, you need JMP 9.3 . This is important to know because I always assume that incremental changes in versions are just a way to get stupid people to pay you more money for what are essentially cosmetic changes. In this case, you need 9.3

SEM is considered part of SAS/STAT so you need either JMP and SAS installed on the same machine or SAS installed on a remote machine to which you have a connection.

SEM does not run on a Mac!

This makes me unhappy because one of the reasons I use JMP often is because it does run native on Mac OS.

For a lot of people, the need to have both JMP and SAS will be a deal-breaker. For me, it is just a little inconvenience. It means the next time I upgrade to JMP, I will a) wait until it is 9.3 and b) make sure I get a PC version as well as a Mac version.

Cool stuff on making your model

First nice feature is you can create a model without running any data. Why would you want to do that? For grant proposals or dissertation proposals where you want to show the model but don’t have any data yet.

It will analyze either raw data or a covariance matrix (which you would expect, but I asked anyway because often in life things I expect, say that children coming home after 1 a.m. will call to inform me of said fact, do not happen.)

If the number of observations are not in the dataset, you can enter that number.

You can adjust the degrees of freedom and use that number instead of the N in the dataset (I’d be sure I knew what I was doing if I did that, if I were you and I would not NOT do this just because my boss wanted higher power of the test. In this case higher power refers not to God but to a greater probability of rejecting a false null hypothesis.)

There is a nice little drawing palette, which is kind of self-evident if you are familiar with SEM and if you’re not, uh, well why are you using this? The ovals are latent constructs, the squares are observed.

You can right click on the diagram and a drop-down menu appears to allow you to add latent variables and other model components.

Setting the variance to 1 is a menu option

If you use AMOS, the structural equation modeling software produced by SPSS you are familiar with the box to set variable properties. Similarly, with JMP you can use a menu or double-click on a variable to set the variable properties

You can hover over a variable and it gives you the choice of a double arrow or single arrow.

You can select the observed variables from a menu or just drag and drop. If you drop the observed variables on the latent construct, JMP will create the indicators and draw the path, which is kind of nice and convenient.

It does NOT do by-group processing, so if you want analyses run for say, white, black and Latino groups you would need to save these as three different datasets. You CAN, however, save your model, so you can create the model once and apply it to the three different datasets.

You can save models in the model library either to re-use with different data or to modify as your theory becomes refined. (No, I do not think the number of cats you own is related to happiness. What the hell was I thinking? I’m going to drop that indicator and replace it with money or sex or jelly beans. Anything but cats.)

You can save just the model or you can save the whole project, including the SAS code and log.

I’ll try to write more of the statistical details later. There was a lot more cool stuff, so much so that it made me really sad that it did not run on a Mac and I don’t think there is a JMP for Linux. ***

* One of my daughters told me that “hook up with” now has an entirely different meaning from when I was young when it meant to meet with someone informally.

** My third daughter’s secret that was revealed in a couple of interviews with Inside MMA or some show like that is that when she is not competing in judo or mixed martial arts she is a complete Pokemon fanatic. Yes, that guru on the forum is not a 13-year-old boy or 27-year-old graduate student living in his parents’ basement. It’s a blonde professional athlete who was a centerfold in Vogue. (And if she’s ever a centerfold in any other kind of magazine being a professional martial artist won’t save her from me!)

*** I once asked some people from SAS why they had sold their souls to the Microsoft overlords. They mumbled something about the need to make money and only having sold eleven copies of SAS for the Mac, or something like that.



My father was born in New York City to two non-citizens who were in the U.S. for a few years, left and never returned. In his twenties, he returned to the U.S. and joined the military, I am pretty sure because it was the one thing he could think of that would most piss off his parents. A few years later, he met and married my mother and now here we all are, a whole family of “credits to our race”.

kid with lollipop

When I tell this to conservative friends, they insist, as they have all my life, that “No, no, we’re not talking about YOUR family, but those OTHER Hispanics that come over here and drop babies just so they can get welfare, WIC and benefit from the American taxpayer while we all do without services.”

One of my former teammates said,

“Those stories in the media must be coming from somewhere.”

I told her, yes, they’re coming from the same place as the stories that our president is a Muslim, Kenyan whose American-born mother for some reason felt the need fifty years ago to place fake birth announcements in the Hawaiian newspapers so that her baby could one day be a U.S. citizen and be president. Oh, wait, children of U.S. citizens ARE citizens, no matter where they are born, like Senator John McCain, who was born in Panama, of U.S. parents.

Having data available from the U.S. Census Bureau, I thought I would interrupt with actual facts. The data come from the American Community Survey, 1% sample of the U.S. To make it easier to analyze, I just downloaded the state of California. If people are coming from Mexico to have babies, I’d think this would be the closest place, rather than say, New Hampshire. So if there really are these hordes of babies being born of mothers who just came for welfare benefits, we should find them here.

The Federation for American Immigration Reform says,

An anchor baby is defined as an offspring of an illegal immigrant or other non-citizen, who under current legal interpretation becomes a United States citizen at birth. These children may instantly qualify for welfare and other state and local benefit programs.

The same site gives examples of hospitals where “over 70% of babies are born to illegal aliens” and expresses concern about “chain migration” where babies when they become 21 years old can sponsor relatives to the U.S. According to the Orwellian-named “Accuracy in Media” site the U.S. spends $6 billion a year on Mexican anchor babies.

There are a few parts of this story that need to be checked out:

  1. Huge numbers of Mexicans are having babies in California
  2. The parents are not U.S. citizens
  3. They just (as in, very recently) came here to have a baby
  4. Their reason for having the baby is so they can claim citizenship status
  5. They also want to come here so they can not work and take advantage of our welfare system

First number,based on the question “Did you have a baby in the last twelve months?” This question is only asked to women (duh!) who are over 14 years of age and under 50 years of age.

Number of births to all Californians in 2009:  531,749

Next let’s look at the number of births to Hispanics in California:

Number of births to mothers of Hispanic descent in California in 2009:  256,455

Of course, most Hispanics are citizens.  If your mother is a citizen, you’ll be a U.S. citizen no matter where you are born. It’s in the constitution. So, how many children were born to Hispanic mothers who are not citizens?

Number of births to mothers of Hispanic descent in California in 2009 who are not citizens: 107,811

Before you get too excited, you need to realize that if EITHER your father OR your mother is a U.S. citizen, you are a citizen. That is, if your dad’s a citizen, mom doesn’t need to make a trek from Mexico when she’s nine months pregnant. Wherever in the world you are born, you’re a citizen, which is why my oldest brother who was born on an air force base in Japan, is American.  For this next step, I had to make an assumption which is that if you are a woman who had a baby and is living with an adult male, he is the father.

What I want to establish is whether the mother or father is a citizen. You could argue that adult male could be HER father. Well, in that case, the mother would be a citizen. If your father (or mother) is a citizen when you are born or  becomes a naturalized citizen before you turn 18 years old, bingo, you’re a citizen.

Number of births in households to mothers of Hispanic descent where neither parent is a citizen in California in 2009:  71,343

Lie #1: There are a huge proportion of births to non-citizens

First of all, let me point out that the proper term for these children, born in the United States is “U.S. citizens” .

Now, California has approximately28% of the Hispanic population of the country (13.7 million out of the 48.4 million). If the birthrate here is proportional that would mean that the entire country would have 1/.28 * 71,343 or

Estimated births in 2009 to Hispanic non-citizens = 254,786

The total number of births in 2009 was 4,136,000

Births of children to Hispanic non-citizen parents = 6.2% of births

While the 70% figures given by the inflammatory websites cited above give the impression that a large percentage of babies born in the U.S. are to non-citizens, I guess those were not intended to be taken as factual statements.

Lie #2: Children of non-citizens cost a large percentage of the budget

Now, California has approximately28% of the Hispanic population of the country (13.7 million out of the 48.4 million) and 66% of Hispanics are Mexicans. Multiplying out these figures we get   –  168,159

Dividing $6 billion by 168,159 I get $35,680 for each child. Where exactly is that $35,680 coming from? It’s not welfare. The average monthly payment for a single mother with one child in the state of Texas is $225. Look it up. In California, a family of three  – a mom and two children, receives $750 a month. Nationwide, the average is $425 a month.

So, if EVERY child is on welfare  the cost is $71,467,575 a year which according to CBS News, is about what the federal government spends in 11 minutes.

Lie #3 : Pregnant women are coming here to have babies just so they can stay in the United States

I should point out one fact that checked out from the hate speech websites is the average non-citizen woman giving birth already has a child over age five who is a U.S. citizen. In fact, of those 71,343 mothers who gave birth, the majority already had a child over five who was a U.S. citizen, which presumably means they have lived here over five years.

Number of Hispanic mothers giving birth in families in which this child was the first citizen in the household: 33,915

Incidentally, that represents 6.4% of California births, and that includes all Hispanics, not just Mexicans. The only reason I used only Mexicans in the previous analysis is that the claim of $6 billion spent per year referred specifically to children born of Mexican parents.

I did not get to the part yet about the GREAT majority of households in which the parents were working. One reason for that is that there was a lot of missing data on the employment variables I wanted to use.

I could go on more but I have to get back and do some work for actual pay.


I compared the percentage of Hispanics in the population estimates from the American Community Survey and the total population of the state of California to the U.S. Census Bureau’s facts on California and came up with the identical numbers, 37.0% and 36,961,664.

The number of births of 531,749 compares well to the 526,774 reported by the California Department of Finance demographic report. Since the two don’t cover the exact same time period, we would not expect identical numbers, but the estimate from the American Community Survey is only off by 0.9% and it is actually higher, so if there is an invasion of anchor babies, this is actually in the direction to support their hypothesis.

I plan on doing more analyses to check more on this topic. Please feel free to post any suggestions. If anyone wants to take a look at my SAS code, it’s really very simple, I can email it to you or post it here. If you find any errors, I would be happy to have them pointed out.



I was disappointed to see that the Open Data community is pretty inactive over at data.gov. With 305,000 datasets and counting released you’d think there’d be more than a handful of people posting over there.  I decided I would start on my own with the TIMSS data. This is the Trends in International Mathematics and Science Survey. Props to them for releasing their data along with their programs, codebooks and publications. It takes some nerve to open up your work to the public and let other people have the freedom to scrutinize it and possibly criticize it.

First, I downloaded three folders containing 20 files – six text data files, six codebooks and eight SAS programs and other documentation.

No one asked my opinion on this – but that’s never stopped me before. Presumably one benefit the government is hoping to obtain by releasing its data is to get feedback – crowd-sourcing, differing perspectives.

Here are a few observations on the TIMSS data. These are not saying there was anything wrong with the way the analyses were done but simply that different people, me, for example, might have different interests.

In their analyses, they seem to have an interest in whether a question was answered incorrectly or just not answered either because the student skipped that item, and went on to others, or because he or she didn’t make it that far in the test, presumably because time ran out. Personally, I am interested in whether the student got it right or wrong. I’m assuming if the student skipped it, the reason was probably that she didn’t know the answer. Deleting out the formats that specified whether the student answered, omitted or skipped saved me thousands of lines.

I created an array of all variables that had a 998 or 999 and changed those to missing.

Originally, for some reason I could not open the file and view it in SAS. I was wondering if the data viewer had a limit to the number of variables, but now I am just thinking at the time I had too many applications running at once and there wasn’t enough memory available, because now it opens up fine.

My first ARRAY for re-coding included all of the variables from the first mathematics item  to the end, BSREA05 .

I realized when I started running the means to check everything that this isn’t what I wanted because that ends up changing all of the subscale scores, gender and age to 0 or 1. I only want the actual test questions coded that way. So, I changed it to:

array rec{*}  M022043 -- S042164 ;

Yes, this recodes all of the science items, too, which I ended up dropping but the math and science items were interspersed and I sure the hell wasn’t going to type in 200+ variable names. My typing skills aren’t that good.

In fact, because I am a total lazy slacker, I did this:
proc means data = in.g8_achieve07  n ;
output out = sam ;
proc transpose data = sam ;
id _stat_ ;
proc sort data = data1 ;
by _name_ ;
proc print data = data1 noobs ;
var _name_ ;

Then, I copied the variable names beginning with S from the output, pasted it under the word DROP and had my drop statement.  Yes, I will do almost anything to avoid typing. My mother told me that my life’s goal should be NOT to become a secretary (she was a secretary for 25 years so I guess she knows whereof she speaks).

At first, I  just coded everything either incorrect or correct, ignoring the partial responses. Then, I got to wondering whether that would make any difference. So, I re-ran the scoring program giving half a point for partial response and one point for a full response. This is not identical to the TIMSS scoring, which gives 2 points for a correct response and 1 for a partial response. I guess the rationale is that a partial response to a really difficult question should be equivalent to a completely correct response to a much easier question. I am just guessing this is their reasoning,  and it does make some sense.

The reason I did it with the 1 point, .5 point is that it took me about one second to modify my program. You can see at right that whether I coded partial credit or not made very little difference over all, but for some items it was significant. Of course, for the items that didn’t have partial credit, it made absolutely no difference. The overall mean of item difficulty changed almost none – it was around .50 either way, which is actually optimal for item difficulty for a test.

Note that neither of the ways I scored the data were the way TIMSS scored it. All multiple choice items were one point in their method, SOME of the problems that required a written solution awarded 2 points with possible 1 point partial response credit. Others did not award partial credit. It’s not difficult to use the formats supplied with the TIMSS data to create a scoring program, but it will take longer than the one second it took me previously.

One reason I am fiddling with the scoring is that I want to see how robust these results are. People always say you can prove anything with statistics. Yeah, in the same way you can prove anything with an apple pie.

You can say, “If you listen very closely, this apple pie says President Obama is an alien” and some people will be stupid enough to believe you or simply very, very ill-informed. (Psst – apple pies can’t talk. Now you know.)

Before I go ahead and do any analyses by group I want to know if there are any global decisions that make a difference, like awarding partial credit or not. Everything I am doing so far just entails getting to know the data better – how it was coded, how it’s distributed, how different scoring criteria might make a difference. I was interested in this because right or wrong is a completely objective fact in mathematics – the area of the rectangle is 32 or it isn’t – but the decision to award partial credit or not is just that, a decision.

For now, though, I have to get back to work that pays actual money. Since it was Easter, I went to mass, of course, then out to eat with my lovely children, then to Universal Studios, which resulted in me getting back, doing  work for actual money until 3 a.m. and then updating my blog.

Therein lies the drawback of much of the analysis of open data, in that it relies on the goodwill of people like me to conduct and document their analyses for free – goodwill which is limited by the desire to occasionally see my children, the importance to me of being at mass on Easter, and the need to have enough cash for the cost of annual passes to Universal Studios.

Oh, and Happy Easter!



As always, I learned a lot at SAS Global Forum this year. It is one of those conferences where there are always two or three sessions going on at once that I would like to attend. I stayed for a bit AFTER the conference because I knew that I would not want to waste any time on sight-seeing while the conference was going on.

On the way home, I went hiking out in the desert. There are wire fences that are meant to keep you from wandering out into the desert, but the wires are close enough that only a really small person could slip through. Apparently the state of Nevada assumes that if you are that small, there is a responsible adult around somewhere to tell you that it is a bad idea to go slipping through wire fences, climbing over hills and roaming through the desert where our ancestors died of thirst (or somebody’s ancestors, mine were chilling in Venezuela on the Caribbean coast).  Since there wasn’t a responsible adult in sight,  I did it anyway and it was very nice.

Having foregone hiking, gambling, drinking and other Nevada attractions during the conference to attend every presentation I possibly could, I was surprised when the inimitable Joe Perry told me that he didn’t come to SAS Global Forum every year for the sessions, neither to attend them, nor to present himself, he came for the networking.

It’s true that I did learn more about the NHANES data from talking to Patricia Berglund, but I only talked to her because I had attended her session on survival analysis and wanted to ask a question afterwards. I didn’t talk to Dr. Yung at all, but I learned a lot in his session on structural equation modeling. The same goes for the sessions on forecasting, the Belgian health care database, hash tables and cluster analysis that I attended. (Okay, well I didn’t learn a lot I didn’t already know about cluster analysis, but I learned a little bit, which made it worth while.) I still have drafts of three more posts about what I learned saved on my iPad, and two of those are about sessions I attended.

On the other hand, I learned a whole, whole lot about the new JMP interface for structural equation modeling by corralling Wayne Watson for half an hour. Last year, I met with an incredibly helpful person from SAS (I am 90% sure his name was Robbie) who showed me a lot about high performance computing.

I have exactly once gotten a consulting job from someone who saw me present at a conference (it was WUSS, actually). I think the company was giving out bonuses for referrals at the time and when they asked if he could recommend anyone, he said, “There was this woman I saw …”.   The profit off of that contract alone paid for every SAS conference I will attend in my life, and just about every conference of any type. But it only happened once and that was a decade ago, and I didn’t even actually even meet the gentleman officially until a year or so later.

And back to Joe Perry who provided a very useful insight. We were talking about a presentation he gave at WUSS a few years ago on managing software projects. I told him that I did not do a lot of the software architecture design that he emphasizes because much of what I do is “throw away programming”, not that it is so bad that it is garbage, but rather one-off projects, where I am doing an analysis for the final report of some program, for example. “Final” meaning the project is over. Joe suggested that perhaps if I spent more time designing and documenting those one-off projects I could re-use parts of them. The more I thought about it, the more I thought he was right. There are reasons I have not done it that way in the past, but the truth is, they aren’t very good reasons.  So, I am going to try changing the work allocation a bit on my current projects and see how it goes. If it really does make us even 5-10% more productive, then, from a financial point of view, that would make it the most worthwhile thing that ever came out of a conference.

But then … there’s the fact that I never would have met Joe in the first place if I hadn’t attended his presentation …

In fact, I’m going to go to LABSug this summer, to WUSS in the fall (where I’ll be presenting SAS Essentials) and to SGF next year. I’m also heading out to the Consortium of Administrators of Native American Rehabilitation meeting in Green Bay, where I’ll be a presenter, and to the SANDS meeting in San Diego, where I’ll also be presenting on using SAS for statistics in some very innovative and fun ways.

I’ve come to the conclusion that conferences for me are like the public libraries (which I love – if you are ever in downtown L.A., the central library, pictured at left, is worth a visit).

You go there to find the information you are seeking, but if you are sensible, you take an hour or so extra to wander around to learn about things you never knew existed.

So,  I think I’ll keep going to the sessions for a while yet.



True confessions: I just don’t get the data hackathon

I am old. If I was not aware of this by looking at my U.S. birth certificate (which, like President Obama, I do have), I have America’s most spoiled thirteen-year-old to tell me,

“You’re old.”

(sometimes followed by – “and stupid” if I have just told her that no, I will not buy her an iPhone 4 because her current iPhone works just fine, or no, I will not buy her a $26 lip gloss at Sephora because THAT is, in fact, stupid, and no one who is thirteen and beautiful needs to wear cosmetics anyway. )

So, it has been established by four daughters in succession becoming thirteen, that I am old and out of it.

Perhaps this explains why the data hackathon and similar concepts mystify me. I just don’t see what of any use can be accomplished by people with no familiarity throwing themselves at data for 24 or 48 hours without sleep.

  1. Have none of you read all of the research on how sleep deprivation has a negative effect on cognitive performance? Do you think for some reason it doesn’t apply to you? Seriously, the ability to go without sleep is not a super-power. It’s more often a sign you’re using coke, which as we used to say in college, is God’s way of telling you that you’re making too much money.
  2. At a rock bottom minimum, to understand any dataset you need to know how it was coded. What is variable M03216, exactly? If that is the question, “What is 3 cubed?”  , then what do the answers A, B, C, D 97 98 and 99 mean? Which of A through D is 27, which is the correct answer, and what the hell are 97 – 99 ? These are usually some type of missing value indicator, possibly showing whether the student was not administered that particular item, didn’t get that far in the test or just didn’t answer it.
  3. Once you’ve figured out how the data are coded, you need to make some decisions about what to do with it AT THE ITEM LEVEL. What I mean by that is, for example, if an item wasn’t administered, I count the data as missing, because there was no way the student could answer it. If the student left it blank, I count it as wrong because I’m assuming, on a test, if you knew the right answer, you would have given it. What about if the student didn’t make it that far in the test? I count it as wrong then, also, but I do realize I have less degree of certainty in that case. My point is, these are decisions that need to be made for each item. You can do it in a blanket way – as in, I’m doing this for all items, or you can do it on a case by case basis. Whether you do it knowingly or not, something is going to happen with those items.
  4. Often, sampling and weighting are issues. If you are doing anything more than just sample descriptive statistics (which aren’t all that useful in and of themselves) you need to know something about how the data are sampled and weighted. Much of the government data available uses stratified sampling, and often certain strata are sampled disproportionately. AT A MINIMUM , you need to read some documentation to find out how the sampling is done. If you don’t, maybe you’ll be lucky and the actual sampling was random, which is the default assumption for every procedure I can imagine. Listen carefully – Hoping to get lucky is a very poor basis for an analysis plan.
  5. Data quality needs to be checked and data cleaning done. Unless you are a complete novice or a complete moron, you are going to start by doing things like checking for out-of-range values, reasonable means and standard deviations, the expected distribution, just to make sure that each variable you are going to be using in your analysis behaves the way it should. If there are no diseases recorded on Sundays in some zip codes, I doubt it’s because people don’t get sick those days but rather because the labs are closed. This may make a difference in the incidence of a disease if people who get sick on weekends go to another area to be treated. You need to understand your data. It’s that simple.
  6. Somewhere in there is the knowledge of analytic procedures part. For example, I wanted to do a factor analysis (just trust me, I had a good reason). Well, because of the way this particular dataset was put together, no one had complete data, so I couldn’t do PROC FACTOR. I thought possibly that I could use PROC CALIS with method = FIML – that is the full information maximum likelihood method. I went to the SAS documentation and flipped through the first several pages, glanced at some model by Bentler illustrating indicators and latent constructs, said to myself, “Blah blah, I know what that is”, skipped over the stuff on various types of covariance matrices, etc. I can guarantee you that the first time I looked at this (which was a long time ago), I did not flip through it. I probably spent 15 minutes looking at just that model making sure that I knew what F1 was, what v1, v2, etc. were and so on. That is one advantage of being old, you’ve seen a lot of this stuff and puzzled it out before.
  7. You need to actually have the software required. Although there is supposedly a version of SAS out there that allows the use of FIML for PROC CALIS none of the clients I have at the moment have the latest version of SAS, so I don’t have access to it. I believe FIML is only available in SAS 9.22.  You might think having the software available would be a benefit of a hackathon but the ones I have seen advertised all say to bring your own laptop, so I am assuming not.
  8. All of this is before knowledge of the actual content area, be it contributions to charity, earthquake severity, mathematics achievement or whatever, I assume there has been research done previously on the topic and it might be helpful to have that knowledge. I will guess that the organizers of the hackathon are assuming whoever attends will already have #5, 6 and 7, here. This is again a bit puzzling to me because the combination of that knowledge took me years of experience and I don’t think I am particularly slow-witted. It’s like all the job announcements you see that want someone twenty-five years old with at least ten years of experience. The hackathons seemed to be aimed at young people. (The advertised free coffee and all the fast food you can eat is not really much an enticement for most people in their fifties. If you can’t afford to buy your own coffee and fast food by that age, I think you took a wrong turn somewhere and perhaps you should hang up that DATA ROCK STAR t-shirt, buy a suit and go get a job.)

So, I don’t get it. I don’t get what you are hoping to pull together in 48 hours, with little sleep, little time, and often limited knowledge of the data. PERHAPS the idea is that if enough people try enough different random things that we will just get lucky??

Luck is not a research strategy.

Oh, and there’s this little thing called Type I error.



Anyone who tells you they know all of SAS is like that creepy guy at the fraternity party who swears that his father is the Duke of Canada, that is, they have a perception of themselves that is not in screaming distance of contact with reality.

Even though I have been using SAS for decades, I am still discovering new functions, tricks, tips and procedures all of the time. In fact, the one I came across today was so helpful, I even cross-posted this on SAScommunity.org

My problem and how I solved it

There are currently over 370,000 datasets on the data.gov site, not to mention the numerous others available from the National Center for Education Statistics and many other open data sources. One problem users of these data often encounter is that the formats used by the creator are not necessarily those desired by the end user. For example, many files have user-defined formats for each individual item such as:

value S9FMT
1 = “A”
2 = “B”
3 = “C”
4 = “D*”
8 = “NOT ADMIN.”
value S10FMT
1 = “A”
2 = “B”
3 = “C”
4 = “D”
5 = “E*”
8 = “NOT ADMIN.”

A common desire would be to have the items that were not administered to the student have a missing value and the rest scored as either correct or incorrect. At first thought, using the PUT function to get the formatted value might seem like a good idea, but that would require specifying the format. Since many of these datasets include several hundred variables and several hundred different formats, that’s not going to work.

Here is one solution using the 2007 dataset for eighth grade students from the Trends in International Mathematics and Science Study (TIMSS) :

%include "C:\Users\me\Documents\TIMSS\Tformats.sas" ;

data scored ;
set in.G8_ACHIEVE07 ;
attrib Fval length = $9. ;
array rec{*}  M022043 — BSSREA05 ;
do i = 1 to dim(rec) ;
FVAL = vvalue(rec{i}) ;
if FVAL  = “NOT ADMIN” then rec{i} = . ;
else if index(FVAL,”*”) > 0 or FVAL = “CORRECT R” then rec{i} = 1  ;
else rec{i} = 0 ;
end ;

The %INCLUDE statement includes the formats defined by the organization that created the dataset.
(TIMSS, like many of the original data sources, includes a folder of SAS files along with the data downloaded that read in text data, create formats and labels, merge files and perform other useful functions. )

The VVALUE function returns the formatted value of the variable.
In the program above, it is necessary to first recode the items that were not administered to have a missing value, and then score the students who were administered an item as having been correct or incorrect. In this particular example, all of the formatted values for correct responses either had an “*” next to the correct multiple choice value or the words “CORRECT RESPONSE” . Of course, this statement would need to be modified depending on the formatted values of your particular dataset.



I’ve never understood masochists. Back in the days when I was competing I would regularly get calls from creepy men who were willing to pay me big bucks to beat them up. As one of my lovely daughters said of the sumo wrestler who sent her a picture of himself posing in his diaper-thingie and asking her for a date,

In a word – eeew!

I am not a masochist. Nor a witch, for that matter, in case anyone is interested. And yet, I find myself spending not hours, but days poring over codebooks and technical manuals to understand open data datasets. The latest is the TIMSS (Trends in International Mathematics and Science Study) dataset.

There does not seem to be any substantial resource for analysis of open data or collaboration by people doing it.  I noticed that donorschoose.org has a competition going on, and that is certainly a worthwhile cause. It focused on their data although they do suggest merging with other possible datasets. There are also the community forums on the data.gov website, which get surprisingly little traffic given that over 300,000 raw datasets are available. Since I could not find any place else to post this information, I am putting it here, largely for myself for later use, but also for anyone else who might be working with TIMSS or similar data and find this information useful. If you do know of resources on analysis of open data, PLEASE post the information here!

Sampling – from the TIMSS technical report and user guide

I don’t believe anything anyone says unless I can prove it myself. My initial suspicion was that perhaps the country comparisons were not equivalent, that is, it may be that we had a very representative sample of students in the U.S. where other countries had more selective samples. For a HYPOTHETICAL example, if you had a country where nearly 100% of students get an education at least through the eighth grade (like in the U.S.) and you compared them to a country where the drop out rate before eighth grade was 50%, then you might find that the U.S. performed more poorly when that is not necessarily the case at all.

I still wonder if that might be true, but what I have concluded so far is that the TIMSS sample seems to be a pretty fair, representative sample of the U.S. Students from high poverty and central city schools are, in fact, slightly underrepresented, but the difference from the population is really so slight that it is not worth mentioning, even though I did just mention it.

What did I expect? Well, I hoped that this might be the case but one reads everything about education in the media from the average teacher in Wisconsin making $100,000 a year (AS IF!) to that high school drop out problems are due to  illegal aliens (slight effect if you remove non-citizens from the data, but very small compared to other factors).  The data on sampling available are extremely detailed and probably more complex than really necessary, in my opinion. However, my opinion is primarily based on the use I intend for these data which is not the same as the main goal of TIMSS.

Personally, I’m very interested in what exactly do American eighth-graders know, at a micro-level, that is, what questions did they get right and which did they get wrong? One of the benefits of making your data open to everyone is that it can be applied to answer questions that were not part of your original study. It opens up the possibility of getting a great deal more in the way of analyses than your research team can do on their own.

Of course, it also means that others can scrutinize every bit of your research design and hold it up for criticism. So, kudos for the TIMSS team for taking the plunge!

Administration and instrument

TIMSS documentation states that the test is designed to measure five areas of mathematics and three levels of difficulty. There are fourteen versions of the test and students receive one at random. They can impute plausible values for the items that were not administered from the ones that were. While I can speculate on some very good reasons for doing it that way – in particular, if you have one version of a very high stakes test, it won’t be that hard for people to get hold of that test and teach the answers to the exact questions that are on it. Having 14 different versions certainly makes cheating much harder. Regardless,  I would have preferred for my purposes that everyone had been given the same test because I’d like a very large N . When I look at the item frequencies, each item says “NOT ADMINISTERED” for about 86% of the subjects. Well, 1/14 is a lot less than 14%, in fact it is more like 7%, so, obviously, each of the 14 different parallel forms wasn’t completely different.

The good news so far is that TIMSS documentation and data have had almost everything I have wanted.

The bad news is that there is an almost overwhelming amount of information. The technical report and user guide is over 300 pages long. The codebook for the eighth grade mathematics achievement is 178 pages long and the eighth grade achievement dataset alone is 572 variables. I’m not particularly interested in science at this point, so I can drop all of those. On the other hand, there are no data on ethnicity or income in this dataset so it is clear I am going to have to merge these data with the school and student files at some point.

The programs include user-defined formats so you can get format errors if you run their programs as is. You have three options, one is to just delete their format statement, a second is to create the formats, either temporarily by copying them at the top of your program or, more sensibly, using a %INCLUDE statement or making them permanent formats and the third is to use the OPTIONS NOFMTERR when you don’t feel like messing with the formats. Usually I would not add formats permanently but since I can already sense that I am going to be using these data a lot, I’m going to go ahead and do that this time.

SURPRISE  – an age of 13 doesn’t mean 13 years old !

It’s going to be a lot of work, but worth it. I’ve already seen some very interesting statistics. I was astounded to see that only 1% of the students were under 14 at the time of testing. In fact, there were twice as many students who were 16 years old in the eighth grade as 13. Even if the testing is at the end of the year, that seems really low.

Just a little tip on age – the ages appear to be stored with decimal ages even though they print as integers. So, if you have a statement that says something like

If bsdage in (14,15) then  ….

You will find very few students.  In fact, you’ll get the students who are exactly 14 and 15.

My recommendation is when you read in the data NOT to use the age format TIMSS uses, which seems to round students to the nearest age. I don’t think most people think of age like after they’re five years old. They quit saying they are  “almost six” and give you their real age.

You can then cut the data however you like. I used a cut-off of under 13.5 years as “young” for the eighth grade and over 15.5 years as old.

Since my mother let three of her children skip grades in school against the advice of administrators, and it turned out with varying results, I have always been fascinated by the experiences of kids who are young for their grade. Personally, I think it was the best thing my mom ever did for me and will be forever grateful that she told the principal to stick it (actually, my mom would never say anything like that and for her to buck authority, especially a nun, was extraordinarily out of character).  I let my oldest daughter start  kindergarten at four and go off to college at 17. My brother, on the other hand, thought it was a great hardship for him and said he would never let his kids skip a year of school.

As I suspected, those who were young for their grade were disproportionately female (64%) while those who were old for their grade were disproportionately male (67%).  Whether this represents anything more than people accepting a stereotype that boys mature late, I don’t know.

The boys who were younger than average seemed to be doing quite well. Male or female, children who were young for their grade did best, children who were old for their grade did worst. My guess would be that children who were advanced were promoted and those who did poorly were held back, certainly not an earth-shattering revelation. What it does seem to suggest at a first glance, though, is for both males and females, being a grade ahead of children their same age isn’t related to academic problems in the eighth grade.

It’s 1 a.m. and I should probably be thinking about sleeping but I am just starting to get into the fun part of the data.

So, what have I learned from open data? In a nutshell, it’s a mountain of work to get started with a dataset of any complexity, but like most times in life, the work can pay off, sometimes in unexpected ways.



One of my favorite movie lines ever comes from The Princess Bride,

“You keep using that word. I do not think it means what you think it means.”

Sometimes I want to say that to people who want me to give an explanation for every result that is “significant”.

Perhaps you would like to test for sphericity. There is a nice (and pretty!) definition of sphericity and compound symmetry given on the graphpad.com site . Compound symmetry and sphericity are often discussed as if they are the exact same thing. They are not exactly, but the two are  related. In case you were dying to know, compound symmetry means the variances AND the covariances are equal.

Sphericity means that the variances of the differences between all pairs are the same. Let’s say you test people once a year for three years. So, the variance of the differences between say, tests given one year apart and two tests given two years apart would be the same if you met this assumption.

Any self-respecting statistics package will include a test for sphericity when doing a MANOVA. In my case, I am doing a repeated measures MANOVA  and I strongly suspected that the test for sphericity would be non-significant having engaged in the highly sophisticated statistical technique of looking at my correlation matrix and noting that the correlations among the variables were almost all the same. This very rarely happens in real life.  I also looked at the covariances and these were pretty much identical also.

You can get the covariances in SAS by doing this:

proc corr data = mydata  cov ;

var  varname1 - varname6  ;

So, how do I get my test of sphericity?

In SAS, I do this:

Proc glm data = mydata ;

class site ;

model  mon tues weds thurs = site ;

repeated quiz 4 / printe  ;

manova  _h_ = all ;

and guess what? My Mauchly’s test of sphericity was significant. What the hell?

One problem, that should have been obvious if I had thought of it, is that I have a huge sample size for this study. Mauchly’s test is sensitive to sample size, that means, among other things, it will be significant with large sample sizes even when you have very minor departures from sphericity.

So,  I did what any self-respecting statistician does when confronted with a trivially small meaningless but statistically significant result. I ignored it and went on with my work.

We find that obvious (I hope) when talking about correlations of .03 or mean differences that amount to .4 points on a scale of 0 to 100. Yet, when it is a test with which a person is not very familiar, all of a sudden a significant result is scary.

“OH MY GOD! We have violated the sphericity assumption! Whatever shall we do? Release the flying monkeys.”


Release the flying monkeys

because they’re not really sure what an important departure from sphericity looks like or even means. This applies to any case, from t-test (for some people) to differences in nested model chi-square values.

Significant is not a synonym for important. Not ever.

P.S. (For a nice explanation of compound symmetry and sphericity , check out A bluffer’s guide to sphericity , by Andy Field. In fact, you should check out the whole Statistics Hell site. It’s pretty funny. )



Since the whole presentation Patricia Berglund gave on survival analysis is available at the SAS Global Forum takeout section (which I explained yesterday, you should have been paying attention), I just wanted to add a few highlights here.

1. Using PROC LIFETEST with a STRATA statements is a very dandy way to show survival curves for different ethnic groups on the same chart, thus visually checking whether the groups have different probabilities of depression over the lifespan. Of course, you can use this same method for any variable and any group, say, college graduation and social class.

2. She compared PROC SURVEYLOGISTIC and PROC SURVEYPHREG and found highly similar results. That was really interesting because it showed that, in her example, treating survival as continuous, as in the Cox model or dichotomous as in the logistic model did not produce much difference.

3. The data she used was from the National Co-morbidity Study, which was interesting to me, since I am very interested in open and semi-open data. (This is a term I just made up for data that is distributed beyond the original institution, for example, through the Inter-university Consortium for Political and Social Research.)

One value of this conference, and, in fact any conference, over just listening to the sessions on the Internet is the ability to ask questions of the presenters, or just get into a conversation with them. Dr. Berglund was incredibly patient with a long line of people who wanted to speak with her after her presentation, of which there were so many she ended up going out of the room to let the next presenter speak and then answering questions for another 20 or 30 minutes after her talk. Somehow we got off topic (yes, I know it’s hard to imagine that happening with me) , talking about open data, and she told me about the NHANES (National Health and Nutrition Evaluation Survey) data, which I did know about, but what I didn’t know was they had tutorials for using their data and just in general better documentation that made it easier to download and use.  As a result, I’ll probably use this for my next middle school statistics presentation. I am happy.

Speaking of the value of conferences – for several years, I have made SAS Global Forum one of the conferences I attend each year. One reason for selecting SGF over the Joint Statistical Meetings, American Educational Research Association or some others is that it matches my personal interests better. I usually attend five or six events a year, I realize that sounds luxurious to some people but still my time and budget is not unlimited.

One reason I come to SGF instead of JSM or AERA is that SGF provides a more applied treatment of how to do statistics including an emphasis on the code. One thing I really liked about this survival analysis presentation was that she provided the code for each model and explained it statement by statement followed by the results.  She did have some discussion on the major concepts in survival analysis at the beginning but it was more a very applied approach including how to set up your data. It’s not that AERA or JSM will not ever have this type of presentation, but there are varying emphases on theory, equations and code. Nothing is wrong with any of that but your choice of which to attend depends on your preference.

My personal bias is this – yes, problems can occur when you have a programmer who really does not understand even the most basic mathematics underlying the procedures, nor the theory. These people, for example, will not see a negative variance as a red flag or question that the mean score on the Beck Depression Inventory is 100.  At least these people can DO SOMETHING and get output for me to look at.

A larger problem I have experienced is with very bright people who can tell me all about how to prove something Euclid proved back in a gazillion B.C. and provide citations for the most recent ten articles in academic journals on whatever the topic is but they can’t actually DO shit. Those people should go to SAS Global Forum and just sit in sessions like this without moving until the lights go out.

See what I provide here, rambling on statistics AND free career advice. You’re welcome.



The first cool thing you should know about Dr. Patricia Berglund is that she and several others put their slides and more up at the SAS Global Forum take-out section. That is NOT, contrary to what you might believe, a place where you can pick up some really good Chinese food to eat while listening to the sessions. Whether or not that would be a good idea is a different issue.

No, au contraire, this is a section where people out of the goodness of their hearts put their slides up, and more, so if you could not attend the conference, or you want to review the material, well, there you have it. For example, Dr. Berglund’s presentation on survival analysis is right here.

One might wonder whether this would cut into conference attendance. I don’t think so, quite the opposite, in my opinion. If I see several really good presentations, I’m more likely to want to attend next year. The takeout section includes just a sample of what you could have seen, and a pretty representative sample. The statistics section is usually very good at this conference, but this year was even better than usual. So, if you see a dozen presentations and know there are 300 more where those came from, you might want to come. If you can’t come because your evil pointy-haired boss won’t approve your travel budget, then you can at least have some benefit from the takeout.

And if you did attend and want a review of a section you really liked, well, there it is.

So, assuming you did not follow that link and listen to the 43 minute video, here are a couple of points:

1. In survival analysis, time to event of interest is a major focus. This is unlike other methods like logistic regression where you are only interested in whether or not the event occurred.

2. The survival distribution function is the probability of survival.

3. The cumulative density function is the probability of the event occurring at or before the time, for example, the probability of death.  Obviously, these two functions are going to be mirror images of each other.

4. The hazard function is the probability of an event occurring given that it hasn’t already occurred.

She did not give this example, but I was reminded of Sudden Infant Death Syndrome, which, if it is going to occur, generally happens under one year of age. I think I held my breath for the first twelve months of life for each of my children. You can imagine the hazard function being different for all sorts of things that are age dependent – for example, miscarriages tend to occur early in pregnancy and the older the fetus the less hazard of miscarriage. The whole hazard function concept is interesting to me but since this was not a one-day seminar, she moved on to …


Now here is the kind of off-label use of SAS procedures that I love. PROC LIFETEST does not currently allow accounting for complex samples. It does, however, provide nice descriptive survival curves, which you might like.

You can’t use a WEIGHT statement with PROC LIFETEST. You can, however, use a FREQ statement. Frequencies can only be integers, but weights are often NOT integers. So, what do you do?

The solution is obvious, really. Multiply your weights by 100 or 1,000 – whatever number you need to make these an integer. What you are interested in obtaining at this stage is the curve, and multiplying every variable by a constant is not going to change the shape of the curve one bit.

You can also use a STRATA statement in PROC LIFETEST, which she did, and produced curves by ethnicity.

Of course, if you don’t account for your sampling method, the standard errors are going to be incorrect, but since the point of this part of the analysis is simply descriptive, it doesn’t matter.

The world’s most spoiled thirteen-year-old has a friend whose parents are divorced and the mother now has a boyfriend. She commented to me,

“I think that is disgusting. People’s parents should not have sex in the house. They should have it somewhere else. Like Mexico.”

Since she is asleep, and I have been gone several days, even though there was a lot more really cool stuff that came up in this session about survival analysis, it will have to wait until tomorrow.

keep looking »


WP Themes