Hopefully, you have read my Beginner’s Guide to Propensity Score matching or through some other means become aware of what the hell propensity score matching is. Okay, fine, how do you get those propensity scores?

Think about this carefully for a moment, if you are using quintiles, you are matching people by which group they fit into as far as probability of being in the treatment group. So, if your friend, Bob, has a predicted probability of 15% of being in the treatment group, his quintile would be 1, because he is in the lowest 20%, that is, the bottom fifth, or quintile. If your other friend, Luella, has a predicted probability of being in the treatment group of 57%, then she is in the third quintile.

Oh, if only there were a means of getting the predicted probability of being in a certain category – oh, wait, there is!

Let’s do binary logistic regression with SAS Studio

First, log into your SAS Studio account.

Second, you probably need to run a program with a LIBNAME statement to make your data available. I am going to skip that step because in this example I’m going to use one of the SASHELP data sets and create a data set in mu WORK library as so, so I don’t need a LIBNAME for that but, as you will see, I do need it later. Here is the program I ran.

data psm_ex ;
set sashelp.heart ;
if smoking = 0 then smoker = 0 ;
else if smoking > 0 then smoker = 1;
WHERE weight_status ne “Underweight” ;

libname mydata “/courses/blahblah/c_123/” ;


My question is if I had people who had the same propensity to smoke, based on age, gender, etc. would smoking still be a factor in the outcome (in this case, death). To answer that, I need propensity scores.

Third, in the window on the left, click on TASKS AND UTILITIES, then STATISTICS and select BINARY LOGISTIC REGRESSION, as shown below.


Next,  choose the data set you want by clicking on the thing under the word DATA that looks like a table of data and selecting the library and data set in that library. Next, under RESPONSE, click the + sign and select the dependent variable for which you want to predict the probability. In this case, it’s whether the person is a smoker or not. Click the arrow next to EVENT OF INTEREST and pick which you want to predict, in this case, your choices are 0 or 1. I selected 1 because I want to predict if the person is  a smoker.

Below that, select your classification variable,

choosing data


There is also a choice for continuous variables (not shown) on the same screen.  I selected AGEATSTART.

I’m going to select the defaults for everything but OUTPUT. Click the arrow at the top of the screen next to MODEL and keep clicking until you see the OUTPUT tab. Click on the box next to CREATE OUTPUT DATASET. Browse for a directory where you want to save it.  I had set that directory in my LIBNAME statement (remember the LIBNAME statement) so it would be available to save the data. Select that directory and give the data set a name.

Click the arrow next to PREDICTED VALUES and in the 3 boxes that appear below it, click the box next to predicted values.

create output data set


After this, you are ready to run your analysis. Click the image of the little running guy above.  When your analysis runs you will have a data set with all of your original data plus your predicted scores.



Now, we just need to compute quintiles.You could find the quintiles by doing doing this:


tables pred_ ;

and look for the 20th, 40th, etc. percentile

However, an easier way if you have thousands of records is

proc univariate data=mydata.statspsm ;
var pred_ ;
output pctlpre=P_ pctlpts= 20 to 80 by 20;
proc print data=data1 ;

Which will give you the percentiles.

Support my day job AND get smarter. Buy Fish Lake for Mac or Windows. Brush up on math skills and canoe the rapids.

girl in canoe

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

I have to choose between either SAS or SPSS for a new course in multivariate statistics. You can take it up with the university if you like, but  these are my only two options, in part because the course is starting soon.

I need to decide in a few days which way to go. Here are my very idiosyncratic reasons for one versus the other:

  • SPSS
  • There is a really good textbook on multivariate statistics that I think would be perfect for these students and it uses SPSS. The book is Advanced and Multivariate Statistics by Mertler & Vannatta, in case you were wondering.
  • SPSS can be installed pretty easily on the desktop and these are pretty non-technical students, so that’s a plus.
  • The point and click interface for SPSS is pretty easy and similar to Excel which most people have used.
  • Personally, I haven’t used SPSS in a while so it would be nice to use something different.


  • Students can just register and go to the website to use SAS Studio
  • Structural equation modeling and other advanced statistics procedures built in and not on add-on
  • SAS Studio is free vs $80 or so for students and $260 for professor (i.e., me) to buy SPSS academic versions including add-ons needed
  • I’m more familiar with SAS and find it easier to code than SPSS syntax.

I’ve toyed with the idea of showing both options but that uses up class time better spent on teaching, for example, how do you interpret a factor loading or AIC.

My big objection to SAS is I can’t find a recent textbook that is good for a multivariate analysis course that is in a social sciences department. The best one is by Cody and that is from 2005. I also use a couple of chapters from the Hosmer & Lemeshow book on Applied Logistic Regression , but I need something that covers factor analysis, repeated measures ANOVA and hopefully, MANOVA and discriminant function analysis, too.

I think most of these students have careers in non-profits and they are not going to be creating new APIs to analyze tweets or anything using enormous databases, so the ability to analyze terabytes is moot. This will probably be their second course in statistics and maybe their first introduction to statistical software.

Suggestions are more than welcome.

Support my day job AND get smarter. Buy Fish Lake for Mac or Windows. Brush up on math skills and canoe the rapids.

girl in canoe

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

P. S. You can skip the hateful comments on why SAS and SPSS both suck and I should be using R, Python or whatever your favorite thing is. Universities don’t usually give carte blanche. These are my two choices.

P.P.S. You can also skip the snarky comments on how doctoral students should have a lot more statistics courses, all take at least a year of Calculus, etc. Even if I might agree with you, they don’t and I need tools that work for the students in my classes, not some hypothetical ideal student.

It ought to be easier than this and perhaps I could have found an easier way if I had more patience than the average ant or very young infant. However, I don’t.
Here was the problem. I wanted control charts for two different variables, satisfaction with care, surveyed at discharge, and satisfaction with care 3 months after discharge.
The data was given in the form of the number of patients out of a sample of 500 who reported being unsatisfied. PROC SHEWHART does not have a WEIGHT statement. You could try using the WEIGHT statement in PROC MEANS but that won’t work. It will give you the correct means if you have the number unsatisfied (undisc = 1)  and the number satisfied (undisc =0) out of 500, but the incorrect standard deviation because the N will be 2, according to SAS.
So, here is what I did and it was not elegant but it did work.
  1. I created two data sets, named q4disc and q4disc3, keeping the month of discharge and the number dissatisfied at discharge and dissatisfied 3 months later, respectively.
  2. I read in the 3 values I was given, month of sample, number unsatisfied at discharge and number unsatisfied 3 months later.
  3. Now, I am going to create a data set of raw data based on the numbers I have. First, in a do loop, for as many as people said they were unsatisfied, I set the value of undisc (unsatisfied at discharge) to to 1 and output a record to the q4disc dataset.
  4. Next, in a do loop for 500- the number dissatisfied, I set undisc = 0 and output a record to the same dataset.
  5. Now, repeat steps 3 & 4 to create a data set of the values of people unhappy 3 months after discharge.
  6. Following the programming statements are the original data.

So, now, I have created two data sets of 6,000 records each with three variables. Doesn’t seem that efficient of a way to do it but now I have the data I need and it didn’t take long and doesn’t take up much space.

data q4disc (keep = undisc month) q4disc3 (keep = undisc3 month) ;
input month $ discunwt disc3unwt ;
Do I = 1 to discunwt ;
    undisc = 1 ;
    output q4disc ;
end ;
Do J = 1 to (500-discunwt) ;
   undisc = 0 ;
   output q4disc;
end ;
Do k = 1 to disc3unwt ;
   undisc3 = 0 ;
   output q4disc3 ;
End ;
Do x = 1 to (500 -disc3unwt) ;
  undisc3 = 1 ;
   output q4disc3;
datalines ;
JAN 24 17
FEB 44 24
MAR 36 15
APR 18 8
MAY 16 11
JUN 19 7
JUL 17 11
AUG 18 9
SEP 27 10
OCT 26 15
NOV 29 12
DEC 26 11
proc shewhart data=WORK.Q4disc;
xschart undisc * month /;
According to SAS

“The XSCHART statement creates and charts for subgroup means and standard deviations, which are used to analyze the central tendency and variability of a process.”

For the three months after discharge variable, just do another PROC SHEWHART with q4disc3 as the dataset and undisc3 as the measurement variable.

OR , once you have the dataset created, you can get the chart using SAS Studio by selecting the CONTROL CHARTS task

Control charts window with month as subgroup and undisc as measure

Either way will give you this result:

Control chart

Support my day job AND get smarter. Buy Fish Lake for Mac or Windows. Brush up on math skills and canoe the rapids.

girl in canoe

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

I wouldn’t normally consider Excel for analysis, but there are four reasons I’ll be using it sometimes for the next class I’m teaching. First of all, we start out with some pretty basic statistics, I’m not even sure I’d call them statistics, and Excel is good for that kind of stuff. Second, Excel now has data analysis tools available for the Mac – years ago, that was not the case. Since my students may have Mac or Windows, I need something that works on both.  Third, many of the assignments in the course I will be teaching use small data sets – and this is real life. If you are at a clinic, you don’t have 300,000,000 records.Four, the number of functions and ease of use of functions in Excel has increased over the years.

For example,


Select all of the data you want and select COPY

Click on the cell where you want the data copied and select PASTE SPECIAL from the edit menu. Click the bottom right button next to TRANSPOSE and click OK. Voila. Data transposed.


Once you have your data in columns (and if it isn’t, see TRANSPOSE above), you just need to

Excel add-ons window

  1. Add the Analysis Pack. You only need to do this once and it should be available with Excel forever more.  To do that, go to TOOLS and select EXCEL ADD-INS. Then click the box next to Analysis ToolPak and click OK.
  2.  Now, go to TOOLS, select DATA ANALYSIS and then pick REGRESSION ANALYSIS

Regression analysis menuYou just need to select the range for the Y variables, probably one column, select the range for the X variables, probably a column adjacent to it, and click OK. You may also select confidence limits, fit plots, residuals and more.

So, yeah, for simple analyses, Excel can be super-simple.

Believe it or not, this is what I do for fun. In my day job, I make video games that teach math and social studies.

You can check out the games we make here.


Working on some fun things  using  SAS Studio, so, expect a number of short posts over the next few days. Last time, I talked about the utilities and how easy it is to import an Excel file. Now let’s say maybe you are not aUnix person  and you have no idea how exactly to code a LIBNAME statement  that is not on Windows.  Never fear, it’s super easy.

Right click on the folder where you want to save your data set. From  the  menu that appears, select the last choice which is ‘properties’.

A window will come up that shows the name of your folder and its location, it’s easy to spot because it’s right next to the word Location. It will look something like this:


to save your data you have uploaded an Excel file and imported into  SAS, remember that the files were saved in the work directory and named import, import 1 etc.If I wanted to  sort those data sets and then merge them together into a permanent data set, I’d do it in the exact same way as if I was using Windows. The only thing different is the LIBNAME statement, as you can see below.

LIBNAME in “/home/your_name/data_analysis_examples”;

Proc sort data = work.import;
By username;

Proc sort data = work.import1;
By username;
data in.crossroads ;
merge work.import work.import1;
By username;

If, later on, I want to use that data set in a program, again I would do it exactly the same as in Windows and the only thing different would be my LIBNAME.


LIBNAME in “/home/your_name/data_analysis_examples”;


Proc means data = in.Crossroads;


Completely random fact, unrelated to SAS studio, or maybe it is related,  I hurt my arm again, so I have been writing my SAS programs using Dragon voice recognition software.  If you are going to use SAS studio on a Mac, you should be aware that Dragon does not work on Firefox on the Mac so open up Chrome if you want to use voice recognition software, or at least the software from Dragon. This has nothing to do with SAS specifically.

 Support My Day Job!  Buy Games That Teach Social Studies And Math And Have Fun! (You can even get Making Camp for free! 

Buy our games


It’s been about a year since I last looked at SAS Studio much  –


In my previous life, I taught for years at a small liberal arts college, with under 2,000 students. I also taught at a tribal community college with less than 500 students. In neither of those situations did we have the funding to pay for expensive software. SAS Studio is FREE. I could have really used this when I was teaching at those small schools. Check it out.


So, it’s free, but I don’t teach that often because I have a day job as president of The Julia Group where clients want me to do some much stuff we quit taking new clients years ago and also president of 7 Generation Games where they want me to do more stuff.

The last class I taught, we used SAS on a remote desktop – which I liked a lot. So, yes, no SAS Studio for me for a while.

In case, like me, you are more a programming type and haven’t been too pointy-clicky, perhaps you missed the TASKS AND UTILITIES. Well, don’t.

Let’s say you want to import a file from Excel into SAS. First, upload it by clicking on the folder where you want it stored and then clicking the upload button at the top left of your screen.

Look to the bottom left of your screen and you will see this. Well, you’ll see the Tasks and Utilities anyway, the stuff above it is files for class examples.

Tasks and utilities menu

Click on the arrow next to Tasks and Utilities and you’ll find all kinds of cool stuff.  Click the arrow next to utilities and pick IMPORT DATA

upload window

Drag the file you uploaded into the window on the right and, voila!

There you go, your Excel file is imported into SAS. You can see the code in the CODE window. DON’T FORGET TO CLICK THE LITTLE RUNNING GUY AT THE TOP OF YOUR SCREEN TO RUN THIS.

Note that the file is named WORK.IMPORT because you’ll need that name for the next task, but that’s next time because I have to go back to work.

/* Generated Code (IMPORT) */
/* Source File: testit.xlsx */
/* Source Path: /home/annmaria.demars/homework */
/* Code generated on: 2/6/17, 11:27 PM */


FILENAME REFFILE ‘/home/annmaria.demars/homework/testit.xlsx’;





Screen Shot 2017-02-06 at 11.36.53 PM

SAS nicely runs the PROC CONTENTS, too, so you end up with a table telling you the contents of your new data set.

Once you have your data imported, you can use the TASKS menu to complete (what else) statistical tasks. I wrote about those in some other posts below:

My point is, there is a lot of stuff under that little tab and you should check it out. Also, if you are a small school, SAS Studio is an awesome resource you can get for free and I bet you could use it.

Support my day job! Learn about Ojibwe history and culture. Practice multiplication and division

FREE GAME FOR iPad or Android tablets

wig wam in snow

Click over here to find links to Making Camp in the App Store or on Google Play. Yes, it’s free.

So this is attempt number two with voice recognition software. Now that I have my new custom splint on and I look something like Darth Vader with the robot arm I thought I had better not just keep doing the same thing that caused this problem in the first place.

my darth vader arm

The arthritis in my hands has just been getting worse to the point where I just had my left thumb, reconstructed. I know from other sports injuries that what happens when you injure one part is that other body parts get stressed and start to get injured. For example if you injure your right knee you start putting so much weight on your left knee to compensate that your left knee soon is giving you problems as well.

The Dragon software that I have only works on Windows although the Mac version is coming out very soon. So far it seems to work better than read and write the Google Chrome extension I have used.

What I like about this software so far is that it can do more than just type. It will open a web browser you can correct and underline words and do other formatting.

It’s going to be kind of weird to get use to dictating instead of typing. I’m sure it’s going to take me a while after all I’ve been typing for probably 40 years. I’m certain though that this will help the problems I have with my hands a lot. I’m not sure I’ll be able to do a lot of coding with this, though who knows.

I don’t think it will really work on planes and airports where I spend an inordinate amount of my time. Maybe it will though, I have a friend who is visually impaired and she talks into her phone all the time giving it messages and commands so I’m sure it’s just a matter of getting used to it.

Well I currently have about 900 unanswered e-mail messages, I also have an IRB application to complete and loads of documentation to write. I expect just like learning to use a word processor for the first time this will be a bit of a time-consuming learning process but well worth it in the end.

You’d think that talking to your computer would feel more natural and it would be easier to write but I can’t say that’s the case at all. Obviously I’m much more used to typing.

We’ll see as time passes if this gets easier. I presume it does.

Do you use voice recognition software to type? If so, how long was it before you felt comfortable doing it?




Fish lake woman

you can make anything into an opportunity.

For example today I had this very unpleasant operation on my phone actually that was my thumb not my phone. as you may have guessed comma I am now writing using a piece of voice recognition software.

me wearing cast inside giant foam pillow on arm

It’s a Google Chrome extension. this makes me happy for two reasons. the pain pills are not one of them.

the first reason is that I have been wanting to experiment more with Google Chrome extensions.  

At some point we are planning on using Chrome extensions 4  for our game making camp. this is a great opportunity for me to start learning more about how extensions work.

The second reason this is a great opportunity is that I have wondered for some time what I’m going to do when I get old.

I’m just not sitting around knitting type of person. my hand has been bothering me for quite some time. it’s only a matter of time until my other hand starts to bother me as well. So I’ve been wondering about this comma what could I do if I didn’t work.

Now all kinds of people including all of my relatives most of my friends tell me all of the time that I should not work so much. I mention that I did not ask any of these people their opinion? You see the issue isn’t that I can’t think of things to do instead of work. the point is that I like to work and the thought that I couldn’t do it anymore is a bit depressing.


There are a few drawbacks of  read and write for Google Chrome which you may have already detected. One is that it has a rather random view of capitalization.  I’m sure that if you read this post closely you can identify other drawbacks. for example like Siri it often misinterpret your words. I left most of the errors here so that you could see. I did fix a few where the sentence made absolutely no sense.

I found it works better if you speak more slowly.


So far  it hasn’t been too bad. it was super easy to install and I figured out how to use the speech to text by watching 2 minute YouTube video.


On the other hand haha  that’s a joke since I only have one hand –  it seems like the only way to get the premium features is to be at a school that licenses those at the school or maybe classroom level.  right now I’m using the 30 day trial version.


The other problem I have found is that sometimes the microphone just randomly quits working.  toggle it off and on to fix Problem.


2 move 2A new line All you need to do is say those words which ironically since I wanted actually those words in the sentence I had to take them otherwise it would have gone to a well you know.


now if you read this you can see it kind of makes me look like a cross between a teenager using text-speak and someone with a very poor grasp of grammar and spelling. however I think that much of that could be improved with practice and getting 2 no the software better. we’ll see if with practice the voice recognition can be accurate at a faster speed because this slow pace is pretty annoying. the invisible developer just told me that I  sound like a bit from Find old radio show called the slow talkers of America.

New line I also think it would be really really difficult to write code using this with all of the special characters required like  square brackets and curly brackets parentheses etcetera etcetera.


After a few weeks tough trying this out I’m going to check out dragon I have a friend who is visually impaired who uses that so I’m going to ask her 2 show me because I’m sure she knows all of the special features as I believe she even used it to write her thesis.. You’re line

If you have any other suggestions either 4 Chrome extensions in general or on using Speech-to-Text software please post it and the comments.



Fish lake woman

A picture says 1,000 words – especially if you are talking to a non-technical audience. Take the example below.

We wanted to know whether the students who played our game Fish Lake at least through the first math problem and the students who gave up at the first sight of math differed in achievement. Maybe the kids who played the games were the higher achieving students and that would explain why they did better on the post-test.

You can see from the chart below this is not the case. The distribution of pretest scores is pretty similar for the kids who quit playing (the top) and those who persisted.

Graphs produced by ODSBeneath the graphs, you can see the box and whisker plots. The persistent group has fewer students at the very low end and we actually know why that is – students with special needs in the fourth- and fifth-grade, for example, those who were non-readers, could not really play the game and either quit on their own very soon or were given alternative assignments by the teacher.

The median (the line inside the box), the mean (the diamond) and 25th percentile (the bottom of the box) are all slightly higher for the persisting group – for the same reason, the students with the lowest scores quit right away.

These data tell us  that the group that continued playing and the group that quit were pretty similar except for not having the very lowest achieving students.

So, if academic achievement wasn’t a big factor in determining which students continued playing the games, what was?

That’s another chart for another day, but first, try to guess what it was.


Would you like to play one of our games? Check them out here – all games run on Mac and Windows.


What about Chromebooks?  Check out Forgotten Trail.

characters traveling on map

If I were to give one piece of advice to a would-be program evaluator, it would be to get to know your data so intimately it’s almost immoral.

Generally, program evaluation is an activity undertaken by someone with a degree of expertise in research methods and statistics (hopefully!) using data gathered and entered by people’s whose interest is something completely different, from providing mental health services to educating students.

Because their interest in providing data is minimal, your interest in checking that data better be maximal. Let’s head on with the data from the last post. We have now created two data sets that have the same variable formats so we are good to go with concatenating them.
DATA answers hmph;
SET fl_answers ansfix1 ;
IF username IN(“UNDEFINED”,”UNKNOWN”) or INDEX(username,”TEST”) > 0 THEN OUTPUT hmph;
ELSE OUTPUT answers;

PRO TIP : I learned from a wise man years ago that one should not just gleefully delete data without looking at it. That is, instead of having a dataset where you put the data you expect and deleting the rest, send the unwanted data to a data set. If it turns out to be what you expected, you can always delete the data after you look at it.

There should be very few people with a username of  ‘UNDEFINED’ or ‘UNKNOWN’. The only way to get that is to be one of our developers who are entering the data in forms as they create and test them, not by logging in and playing the game.   The INDEX function checks in the variable in the first argument for the string given in the second and returns the starting position of the string, if found. So,  INDEX(username, “TEST”) > 0 looks for the word TEST anywhere in the username.

Since we ask our software testers to put that word in the username they pick, it should delete all of the tester records. I looked at the hmph data set and the distribution of usernames was just as I expected and most of the usernames were in the answers data set with valid usernames.

Did you remember that we had concatenated the data set from the old server and the new server?

I hope you did because if you didn’t you will end up with a whole lot of the same answers in their twice.

Getting rid of the duplicates

PROC SORT DATA = answers OUT=in.all_fl_answers NODUP ;
by username date_entered ;

The difference between NODUP and NODUPKEY is relevant here. It is possible we could have a student with the same username and date_entered because different schools could have assigned students the same username. (We do our lookups by username + school). Some other student with the same username might have been entering data at the same time in a completely different part of the country. The NODUP option only removes records if every value of every variable is the same. The NODUPKEY removes them if the variables in the BY statement are duplicates.

All righty then, we have the cleaned up answers data, now we go back and create a summary data set as explained in this post. You don’t have to do it with SAS Enterprise Guide as I did there, I just did it for the same reason I do most things, the hell of it.


PROC SORT DATA = in.answers_summary ;
BY username ;

PROC SORT DATA = in.all_fl_students ;
BY username ;

DATA in.answers_studunc odd;
MERGE in.answers_summary (IN=a) in.all_fl_students (IN=b) ;
IF a AND b THEN OUTPUT in.answers_studunc  ;

The PROC SORT steps sort. The MERGE statement merges. The IN= option creates a temporary variable with the name ‘a’ or ‘b’. You can use any name so I use short ones.  If there is a record in both the student record file and the answers summary file then the data is output to a data set of all students with summary of answers.

There should not be any cases where there are answers but no record in the student file. If you recall, that is what set me off on finding that some were still being written to the old server.


There is a sad corner of statistical purgatory for people who don’t look at their log files because they don’t know what they are looking for. ‘Nuff said.

This looks exactly as it should. A consistent finding in the pilot studies of assessment of educational games has found a disconcertingly low level of persistence. So, it is expected that many players quit when they come to the first math questions.  The fact that of the 875 players slightly less than 600 had answered any questions was somewhat expected. As expected, there were no records where

NOTE: There were 596 observations read from the data set IN.ANSWERS_SUMMARY.
NOTE: There were 875 observations read from the data set IN.ALL_FL_STUDENTS.
NOTE: The data set IN.ANSWERS_STUDUNC has 596 observations and 11 variables.
NOTE: The data set WORK.ODD has 0 observations and 11 variables.

So, now, after several blog posts, we have a data set ready for analysis ….. almost.

Want to see these data at the source?

Check out our game, playable on Mac or Windows. Download Spirit Lake or Fish Lake  to play, or for Forgotten Trail, just click on the link provided, no download required.

Mom and kid

You can also donate a copy of the game to a school or give as a gift.

Further Reading

For more on SAS character functions check out Ron Cody’s paper An Introduction to Character Functions, an oldie but goodie from WUSS back in 2003.

Or you could read my last post!

This paper by Britta Kelsey from SAS Users Group International in 2005 will tell you more than you want to know about the NODUP and NODUPKEY.

← Previous PageNext Page →