### May

#### 26

# Using Characterize Data Task to Inspect Data Quality

Filed Under Software, statistics, Technology | Leave a Comment

Since I had done a few youtube videos on using SAS Studio, I thought I would add them to my blog. This one uses the characterize data task to take a quick look at the data, but I suppose you could have guessed that from the title.

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

### Apr

#### 30

# Pointy, Clicky Propensity Score Matching With SAS

Filed Under Software, statistics | Leave a Comment

Hopefully, you have read my Beginner’s Guide to Propensity Score matching or through some other means become aware of what the hell propensity score matching is. Okay, fine, how do you get those propensity scores?

Think about this carefully for a moment, if you are using quintiles, you are matching people by which group they fit into as far as probability of being in the treatment group. So, if your friend, Bob, has a predicted probability of 15% of being in the treatment group, his quintile would be 1, because he is in the lowest 20%, that is, the bottom fifth, or quintile. If your other friend, Luella, has a predicted probability of being in the treatment group of 57%, then she is in the third quintile.

Oh, if only there were a means of getting the predicted probability of being in a certain category – oh, wait, there is!

Let’s do binary logistic regression with SAS Studio

First, log into your SAS Studio account.

Second, you probably need to run a program with a LIBNAME statement to make your data available. I am going to skip that step because in this example I’m going to use one of the SASHELP data sets and create a data set in mu WORK library as so, so I don’t need a LIBNAME for that but, as you will see, I do need it later. Here is the program I ran.

data psm_ex ;

set sashelp.heart ;

if smoking = 0 then smoker = 0 ;

else if smoking > 0 then smoker = 1;

WHERE weight_status ne “Underweight” ;

libname mydata “/courses/blahblah/c_123/” ;

run;

My question is if I had people who had the same propensity to smoke, based on age, gender, etc. would smoking still be a factor in the outcome (in this case, death). To answer that, I need propensity scores.

Third, in the window on the left, click on TASKS AND UTILITIES, then STATISTICS and select BINARY LOGISTIC REGRESSION, as shown below.

Next, choose the data set you want by clicking on the thing under the word DATA that looks like a table of data and selecting the library and data set in that library. Next, under RESPONSE, click the + sign and select the dependent variable for which you want to predict the probability. In this case, it’s whether the person is a smoker or not. Click the arrow next to EVENT OF INTEREST and pick which you want to predict, in this case, your choices are 0 or 1. I selected 1 because I want to predict if the person is a smoker.

Below that, select your classification variable,

There is also a choice for continuous variables (not shown) on the same screen. I selected AGEATSTART.

I’m going to select the defaults for everything but OUTPUT. Click the arrow at the top of the screen next to MODEL and keep clicking until you see the OUTPUT tab. Click on the box next to CREATE OUTPUT DATASET. Browse for a directory where you want to save it. I had set that directory in my LIBNAME statement (remember the LIBNAME statement) so it would be available to save the data. Select that directory and give the data set a name.

Click the arrow next to PREDICTED VALUES and in the 3 boxes that appear below it, click the box next to predicted values.

After this, you are ready to run your analysis. Click the image of the little running guy above. When your analysis runs you will have a data set with all of your original data plus your predicted scores.

Now, we just need to compute quintiles.You could find the quintiles by doing doing this:

PROC FREQ DATA=MYDATA.STATSPSM ;

tables pred_ ;

and look for the 20th, 40th, etc. percentile

However, an easier way if you have thousands of records is

proc univariate data=mydata.statspsm ;

var pred_ ;

output pctlpre=P_ pctlpts= 20 to 80 by 20;

proc print data=data1 ;

Which will give you the percentiles.

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

### Apr

#### 27

One advantage of writing this blog for almost a decade is that there are a lots of topics I have already covered. However, software moving at the speed that it does, there are always updates.

So, today I’m going to recycle a couple of older posts that introduce you to propensity score matching. Then, tomorrow, I will show you how to get your propensity scores with just pointing and clicking with a FREE (as in free beer) version of SAS.

## Before you even THINK about doing propensity score matching …

Propensity score matching has had a huge rise in popularity over the past few years. That isn’t a terrible thing, but in my not so humble opinion, many people are jumping on the bandwagon without thinking through if this is what they really need to do.

The idea is quite simple – you have two groups which are non-equivalent, say, people who attend a support group to quit being douchebags and people who don’t. At the end of the group term, you want to test for a decline in douchebaggery.

However, you believe that that people who don’t attend the groups are likely different from those who do in the first place, bigger douchebags, younger, and, it goes without saying, more likely to be male.

The very, very important key phrase in that sentence is YOU BELIEVE.

Before you ever do a propensity score matching program you should test that belief and see if your groups really ARE different. If not, you can stop right now. You’d think doing a few ANOVAs, t-tests or cross-tabs in advance would be common sense. Let me tell you something, common sense suffers from false advertising. It’s not common at all.

Even if there are differences between the groups, it may not matter unless it is related to your dependent variable, in this case, the Unreliable Measure of Douchebaggedness.

## What type of Propensity Score Matching is for you? A statistics fable

Once upon a time there were statisticians who thought the answer to everything was to be as precise, correct and “bleeding edge” as possible. If their analyses were precise to 12 decimal places instead of 5, of course they were better because as everyone knows , 12 is more than 5 (and statisticians knew it better, being better at math than most people).

Occasionally, people came along who suggested that newer was not always better, that perhaps sentences with the word “bleeding” in them were not always reflective of best practices, as in,

“I stuck my hand in the piranha tank and now I am bleeding.”

Such people had their American Statistical Association membership cards torn up by a pack of wolves and were banished to the dungeon where they were forced to memorize regular expressions in Perl until their heads exploded. Either that, or they were eaten by piranhas.

Perhaps I am exaggerating a tad bit, but it is true that there has been an over-emphasis on whatever is the shiniest, new technique on the block. Before my time, factor analysis was the answer to everything. I remember when Structural Equation Modeling was the answer to everything (yes, I am old). After that, Item Response Theory (IRT) was the answer to everything. Multiple imputation and mixed models both had their brief flings at being the answer to everything. Now it is propensity scores.

A study by Sturmer et al. (2006) is just one example of a few recent analyses that have shown an almost logarithmic growth in the popularity of propensity score matching from a handful of studies to in the late nineties to everybody and their brother.

You can read the rest of the post about choosing a method of propensity score matching here. If your clicking finger is tired, the take away message is this — quintiles, which are much simpler, faster to compute and easier to explain, are generally just as effective as more complex methods.

Now that we are all excited about quintiles, the next couple of posts will show you how to compute those in a mostly pointy-clicky manner.

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

### Apr

#### 26

# SAS vs SPSS for Teaching Multivariate Analysis in Social Sciences

Filed Under Software, statistics | 6 Comments

I have to choose between either SAS or SPSS for a new course in multivariate statistics. You can take it up with the university if you like, but these are my only two options, in part because the course is starting soon.

I need to decide in a few days which way to go. Here are my very idiosyncratic reasons for one versus the other:

- SPSS
- There is a really good textbook on multivariate statistics that I think would be perfect for these students and it uses SPSS. The book is Advanced and Multivariate Statistics by Mertler & Vannatta, in case you were wondering.
- SPSS can be installed pretty easily on the desktop and these are pretty non-technical students, so that’s a plus.
- The point and click interface for SPSS is pretty easy and similar to Excel which most people have used.
- Personally, I haven’t used SPSS in a while so it would be nice to use something different.

SAS

- Students can just register and go to the website to use SAS Studio
- Structural equation modeling and other advanced statistics procedures built in and not on add-on
- SAS Studio is free vs $80 or so for students and $260 for professor (i.e., me) to buy SPSS academic versions including add-ons needed
- I’m more familiar with SAS and find it easier to code than SPSS syntax.

I’ve toyed with the idea of showing both options but that uses up class time better spent on teaching, for example, how do you interpret a factor loading or AIC.

My big objection to SAS is I can’t find a recent textbook that is good for a multivariate analysis course that is in a social sciences department. The best one is by Cody and that is from 2005. I also use a couple of chapters from the Hosmer & Lemeshow book on Applied Logistic Regression , but I need something that covers factor analysis, repeated measures ANOVA and hopefully, MANOVA and discriminant function analysis, too.

I think most of these students have careers in non-profits and they are not going to be creating new APIs to analyze tweets or anything using enormous databases, so the ability to analyze terabytes is moot. This will probably be their second course in statistics and maybe their first introduction to statistical software.

Suggestions are more than welcome.

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

P. S. You can skip the hateful comments on why SAS and SPSS both suck and I should be using R, Python or whatever your favorite thing is. Universities don’t usually give carte blanche. These are my two choices.

P.P.S. You can also skip the snarky comments on how doctoral students should have a lot more statistics courses, all take at least a year of Calculus, etc. Even if I might agree with you, they don’t and I need tools that work for the students in my classes, not some hypothetical ideal student.

### Mar

#### 25

# How to compute a standard deviation and control chart when you don’t have raw data

Filed Under Software, statistics, Technology | 1 Comment

- I created two data sets, named q4disc and q4disc3, keeping the month of discharge and the number dissatisfied at discharge and dissatisfied 3 months later, respectively.
- I read in the 3 values I was given, month of sample, number unsatisfied at discharge and number unsatisfied 3 months later.
- Now, I am going to create a data set of raw data based on the numbers I have. First, in a do loop, for as many as people said they were unsatisfied, I set the value of undisc (unsatisfied at discharge) to to 1 and output a record to the q4disc dataset.
- Next, in a do loop for 500- the number dissatisfied, I set undisc = 0 and output a record to the same dataset.
- Now, repeat steps 3 & 4 to create a data set of the values of people unhappy 3 months after discharge.
- Following the programming statements are the original data.

So, now, I have created two data sets of 6,000 records each with three variables. Doesn’t seem that efficient of a way to do it but now I have the data I need and it didn’t take long and doesn’t take up much space.

“The XSCHART statement creates and charts for subgroup means and standard deviations, which are used to analyze the central tendency and variability of a process.”

For the three months after discharge variable, just do another PROC SHEWHART with q4disc3 as the dataset and undisc3 as the measurement variable.

OR , once you have the dataset created, you can get the chart using SAS Studio by selecting the CONTROL CHARTS task

Either way will give you this result:

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

### Mar

#### 12

Perhaps you have watched the Socrata videos on how to do data visualization with government data sets and it is still not working for you. Here is a step by step example of answering a simple question.

### Is the prevalence of alcohol use among youth higher in rural states than urban ones?

You can click on a link below to go directly to that step.

First, I went to this Chronic Disease and Health Promotion Data & Indicators site.

Second, I selected Chronic Disease Indicators for my health area of interest.

Third, I selected ALCOHOL – which brought me to the screen showing all the columns of data and a bunch of choices.

Fourth, I clicked on FILTER on the right of the screen and then select a column to filter by.

Chrome did not give me a scroll bar so the furthest option I could get was Topic. I switched over to Firefox and was able to get this menu where I selected Question and Alcohol use among youth. You have to type in the value that you want. Make sure it is spelled exactly the same as in the data set.

Fifth, since I wanted to compare urban and rural states, I clicked Add a New Filter Condition and then selected California, New York, North Dakota and Wyoming with LocationDesc as the filter condition. Make sure the box next to each condition you want is clicked on.

Sixth, I looked at my data, saw there was no data for California and I was sad. Not every state participates in every data set.

7. So, I decided to compare urban, eastern states wth rural midwest/west and I selected New York, New Jersey, Massachusetts, Wyoming, North Dakota and Montana All had data so I was good to go.

In case you were wondering, I based my choice on the listing of states by urbanization , New Jersey is #2, MA 5 and NY 13

On the other extreme, Wyoming is 39, North Dakota is 42, and Montana 47 so I thought this was a pretty good split.

8. I clicked on visualize on the right, selected Column as the type of chart, Location Desc as the label data, DataValueAlt as the data value, and there was my chart

Note: I could not select DataValue. My guess is that was a string variable. I had to select DataValuealt, which was the exact same value

9.Just to make it more obvious, I went in and sorted on data value, which caused the chart to be recreated automatically.

You can see below the chart it created. It’s pretty clear that in these data there is no relationship between urbanization and alcohol use among youth.

New York and New Jersey where the lowest and highest prevalence, respectively. I was hoping to see a pattern with more rural states higher, but it seemed to be pretty unrelated.

HOW TO DOWNLOAD THE DATA SETS FOR ANALYSIS

Perhaps you would prefer to download the data set for import into some other tool, say, Excel or SAS. The first three steps are the same, into you find the data set you want.

This next step is not required , but the data sets can be pretty big, so I’d suggest filtering on at least one major variable first. For example, you can click the three rows next to any column, say, Question, and then select the question that interests you, say Binge Drinking.

Next, click the EXPORT button at the top right of the screen. Select the format in which you want your file to be downloaded. That’s it!

### Mar

#### 2

# Excel for regression analysis: What a surprise!

Filed Under Software, statistics, Technology | Leave a Comment

I wouldn’t normally consider Excel for analysis, but there are four reasons I’ll be using it sometimes for the next class I’m teaching. First of all, we start out with some pretty basic statistics, I’m not even sure I’d call them statistics, and Excel is good for that kind of stuff. Second, Excel now has data analysis tools available for the Mac – years ago, that was not the case. Since my students may have Mac or Windows, I need something that works on both. Third, many of the assignments in the course I will be teaching use small data sets – and this is real life. If you are at a clinic, you don’t have 300,000,000 records.Four, the number of functions and ease of use of functions in Excel has increased over the years.

**For example,**

**TRANSPOSE AN ARRAY IN EXCEL**

Select all of the data you want and select COPY

Click on the cell where you want the data copied and select PASTE SPECIAL from the edit menu. Click the bottom right button next to TRANSPOSE and click OK. Voila. Data transposed.

**PERFORMING A REGRESSION ANALYSIS**

Once you have your data in columns (and if it isn’t, see TRANSPOSE above), you just need to

- Add the Analysis Pack. You only need to do this once and it should be available with Excel forever more. To do that, go to TOOLS and select EXCEL ADD-INS. Then click the box next to Analysis ToolPak and click OK.
- Now, go to TOOLS, select DATA ANALYSIS and then pick REGRESSION ANALYSIS

You just need to select the range for the Y variables, probably one column, select the range for the X variables, probably a column adjacent to it, and click OK. You may also select confidence limits, fit plots, residuals and more.

So, yeah, for simple analyses, Excel can be super-simple.

Believe it or not, this is what I do for fun. In my day job, I make video games that teach math and social studies.

You can check out the games we make here.

### Dec

#### 9

# Standardized testing: Solving your reliability problem

Filed Under computer games, statistics | Leave a Comment

One person, whose picture I have replaced with the mother from our game, Spirit Lake, so she can remain anonymous, said to me:

But there is nothing we can do about it, right?I mean, how can you stop kids from guessing?

This was the wrong question. What we know about the measure could be summarized as this:

- Students in many low-performing schools were even further below grade level than we or the staff in their districts had anticipated. This is known as new and useful knowledge, because it helps to develop appropriate educational technology for these students. (Thanks to USDA Small Business Innovation Research funds for enabling this research.)
- Because students did not know many of the answers, they often guessed at the correct answer.
- Because the questions were multiple choice, usually A-D, the students had a 25% probability of getting the correct answer just by chance, interjecting a significant amount of error when nearly all of the students were just guessing on the more difficult items.
- Three-fourths of the test items were below the fifth-grade level. In other words, if you had only gotten correct the answers three years below your grade level, the average seventh-grader should have scored 75% – generally, a C.

There are actually two ways to address this and we did both of them. The first is to give the test to students who are more likely to know the answers so less guessing occurs. We did this, administering the test to an additional 376 students in low-performing schools in grades four through eight. While the test scores were significantly higher (Mean of 53% as opposed to mean of 37% for the younger students) they were still low. The larger sample had a much higher reliability of 87. Hopefully, you remember from your basic statistics that restriction of range attenuates the correlation. By increasing the range of scores, we increased our reliability.

The second thing we did was remove the probability of guessing correctly by changing almost all of the multiple choice questions into open-ended ones. There were a few where this was not possible, such as which of four graphs shows students liked eggs more than bacon . We administered this test to 140 seventh-graders. The reliability, again was much higher: .86

However, did we really solve the problem? After all, these students also were more likely to know (or at least, think they knew, but that’s another blog) the answer. The mean went up from 37% to 46%.

To see whether the change in item type was effective for lower performing students, we selected out a sub-sample of third and fourth-graders from the second wave of testing. With this sample, we were able to see that reliability did improve substantially from .57 to. 71 . However, when we removed four outliers (students who received a score of 0), reliability dropped back down to .47.

What does this tell us? Depressingly, and this is a subject for a whole bunch of posts, that a test at or near their stated ‘grade level’ is going to have a floor effect for the average student in a low-performing school. That is, most of the students are going to score near the bottom.

It also tells us that curriculum needs to start AT LEAST two or three years below the students’ ostensible grade level so that they can be taught the prerequisite math skills they don’t know. This, too, is the subject for a lot of blog posts.

—-

*For schools who use our games, we provide automated scoring and data analysis. If you are one of those schools and you’d like a report generated for your school, just let us know. There is no additional charge.*

### Nov

#### 20

Last post I wrote a little about local norms versus national norms and gave the example of how the best-performing student in the area can still be below grade level.

Today, I want to talk a little about tests. As I mentioned previously, when we conducted the pretest prior to student playing our game, Spirit Lake, the average student scored 37% on a test of mathematics standards for grades 2-5. These were questions that required them to say, subtract one three-digit number from another or multiply two one-digit numbers.

Originally, we had written our tests to model the state standardized tests which, at the time, were multiple choice. This ended up presenting quite a problem. Here is a bit of test theory for you. A test score is made up two parts – true score variance and error variance.

True score variance exists when Bob gets an answer right and Fred gets it wrong because Bob really knows more math (and the correct answer) compared to Fred.

Error variance occurs when, for some reason, Bob gets the answer right and Fred gets it wrong even though there really is no difference between the two. That is, the variance between Fred and Bob is an error. (If you want to be picky about it, you would say it was actually the variance from the mean was an error, but just hush.)

How could this happen? Well, the most likely explanation is that Bob guessed and happened to get lucky. (It could happen for other reasons – Fred really knew the answer but misread the question, etc.)

If very little guessing occurs on a test, or if guesses have very little chance of being correct, then you don’t have to worry too much.

However, the test we used initially had four multiple-choice items for each question. The odds of guessing correctly were 1 in 4, that is, 25%. Because students turned out to be substantially further below grade level than we had anticipated, they did a LOT of guessing. In fact, for several of the items, the percentage of correct responses was close to the 25% students would get from randomly guessing.

When we computed the internal consistency reliability coefficient (Cronbach alpha) which measures the degree to which items in a test correlate with one another, it was a measly .57. In case you are wondering, no, this is not good. It shows a relatively high degree of error variance. So, we were sad.

**SAS CODE FOR COMPUTING ALPHA**

`PROC CORR DATA = mydataset NOCORR ALPHA ;`

VAR item1 – item24 ;

The very simple code above will give you coefficient alpha as well as the descriptive statistics for each item. Since we very wisely scored our items 0 = wrong, 1= right a mean of say, .22 would indicate that only 22% of students answered an item correctly.

To find out how we fixed this, read the next post.

### Nov

#### 19

# Standardized Testing In Plain Words

Filed Under computer games, statistics | 1 Comment

I hate the concept of those books with titles like “something or other for dummies” or “idiot’s guide to whatever” because of the implication that if you don’t know microbiology or how to create a bonsai tree of take out your own appendix you must be a moron. I once had a student ask me if there was a structural equation modeling for dummies book. I told her that if you are doing structural equation modeling you’re no dummy. I’m assuming you’re no dummy and I felt like doing some posts on standardized testing without the jargon.

I haven’t been blogging about data analysis and programming lately because I have been doing so much of it. One project I completed recently was analysis of data from a multi-year pilot of our game, Spirit Lake.

Before playing the game, students took a test to assess their mathematics achievement. Initially, we created a test that modeled the state standardized tests administered during the previous year, which were multiple choice. We knew that students in the schools were performing below grade level but how far below surprised both us and the school personnel. A sample of 93 students in grades 4 and 5 took a test that measured math standard for grades 2 through 5. The mean score was 37%. The highest score was 63%.

Think about this for a minute in terms of local and national norms. The student , let’s call him Bob, who received a 63% was the highest among students from two different schools across multiple classes. (These were small, rural schools.) So, Bob would be the ‘smartest’ kid in the area. With a standard deviation of 13%, Bob scored two standard deviations above the mean.

Let’s look at it from a different perspective, though. Bob, a fifth-grader, took a test where three-fourths of the questions were at least a year, if not, two or three, below his current grade level, and barely achieved a passing score. Compared to his local norm, Bob is a frigging genius. Compared to national norms, he’s none too bright. I actually met Bob and he is a very intelligent boy, but when most of his class still doesn’t know their multiplication tables, it’s hard for the teacher to get time to teach Bob decimals, and really, why should she worry, he’s acing every test. Of course, the class tests are a year below what should be his grade level.

One advantage of standardized testing, is that if every student in your school or district is performing below grade level it allows you to recognize the severity of the problem and not think “Oh, Bob is doing great.”

He wouldn’t be the first student I knew who went from a ‘gifted’ program in one community to a remedial program when he moved to a new, more affluent school.

—