Jul
30
I was going to come back to the hotel after work and finish up my paper for the Western Users of SAS Software (WUSS) conference. This one is called Better-looking SAS for a Better Community. It was going to use the analyses I did of inequality in America and presented to urban Los Angeles middle school students this past year.
However, another teacher asked me to do a presentation for his class, and he teaches in Missouri, in a school that is 95% African-American. The demographics being different, I thought a different presentation would be in order, so I am working on one called “A Study in Black and White”.
So far, it’s coming out well.
That was, until Erich, my former business partner, suggested I go to the pow-wow tonight. I weighed it – finish paper, go to pow-wow, Graph-N-Go , fry bread, SAS Enterprise Guide, Indian tacos, plots of predicted versus actual, fancy dancing.
In the end, the pow-wow won out. The picture below is from what I thought was the coolest part which is the inter-tribal dance where everyone from all of the tribes comes out and dances, men, women, kids, shawl dancers, grass dancers, jingle dress dancers, traditional dancers and dancers I don’t know the name for.
My reasoning was, how many times in life will I get to go to a pow-wow? The last pow-wow I went to was probably 18 years ago. On the other hand the last SAS program I wrote was last night.
This is my partner’s nephew, whose father used to be the tribal council chairman. When he’s not dancing at pow-wows he’s in law school. I suspect a few years from now federal cases on tribal sovereignty might be a bit more interesting.
Did I learn more from the pow-wow than doing my paper? Surprisingly, yes. I learned that the inter-tribal dance is gorgeous. I learned that there are more things in life than writing one more paper . Since I could not get Internet access, Cinch didn’t work, Twitter didn’t work, Facebook didn’t work. I learned that sometimes it’s good to quit recording things and just witness them.
Oh, and Indian tacos are not very much like Mexican tacos but they still taste delicious and that tripe soup looks just like menudo and there is no way in hell I am eating either one.
Jul
23
Data Triage with SAS
July 23, 2011 | 1 Comment
Getting my papers written for the 2011 Western Users of SAS Software (WUSS) meeting . All three papers are in the SAS Essentials strand, which is designed to provide a jump start in their career for relatively new programmers. I’ve given some version of this talk a few times before, and I sometimes think it is pretty basic. I kind of want to do cool stuff with macros and parallel analysis criterion and multiple imputation, and like the ads for barbells in the back of the comic book for the ninety-three pound weaklings say, “Impress your friends!” (Or are those the Viagra ads? I get confused.)
Then, I think of a very wise comment my former business partner and co-founder of Spirit Lake Consulting, Inc. once made to a new Ph.D. we had hired, he said,
I know the type of low bandwidth, low tech site we have is not going to impress your colleagues at the university you just graduated from, but guess what, we’re not in business to impress those people. We’re in business to serve a certain audience and this is what meets that audience’s needs.
I remember Erich’s advice and think about the intended audience. These sessions are billed as suitable for novice programmers. I remember the first SAS conference I attended – SUGI 10, in Reno, NV. There weren’t a lot of engineers who were single mothers of preschoolers back then. I called ahead and reserved a babysitter at the hotel. She took Maria to check out the elephants at the Golden Nugget, where the conference was held. Maria now covers the social media beat for ESPN. It was a long time ago. I still remember, though, wandering around and trying to find something that was not impossibly over my head.
So …. I think of me, Maria and the elephants as I get ready for each talk and wonder, “Is this really too basic?’ .
In case you’re dying to know, Part 2 of my SAS Essentials talk is on Data Triage and this is what I have to say.
Triage, as both Merriam-Webster dictionary and any medical professional, of which I am not one, knows is “a : the sorting of and allocation of treatment to patients and especially battle and disaster victims according to a system of priorities designed to maximize the number of survivors”.
I call these steps data triage because even if you don’t have time to do anything else, do this.
Learn to stare at your data. I’m serious. If there’s ever a meeting where the results are completely wrong because of some fundamental flaw in the data, when the client is pointing his finger and screaming,
“Didn’t you even LOOK at these data?”
you don’t want to be the one at the end of that pointing finger.
Finding and fixing data disasters is a two-step process. You need to do both of these steps but if you don’t do the first, which I refer to as “data triage”, I just might come to your house and personally give you a good talking to. The first step uses PRINT, MEANS and FREQ procedures to quickly identify any glaringly obvious problems with your data. Here’s what to do and why you do it.
Print out the first 10 records.
PROC PRINT DATA = lib.g8_student07 (OBS = 10 ) ;
Don’t forget the OBS = 10 ! You don’t want to print out a dataset of 357,000 records! If something is completely off, it should show up in the first 10 records. You may be thinking that you should print more than 10, but remember, many of these datasets have 500 or more variables, and there are more efficient ways to analyze your data than looking at >5,000 numbers. At this point, you are just looking for glaring errors, like values entered for gender are “dog”, “tiger” and “pickle”. The most likely error you’ll spot here is that at some point in the INPUT statement an error was made and now all of the data are off by one column.
PROC MEANS DATA = lib.g8_student07 ;
At this point, you’re only looking for one thing and that is if values are out of range, for example, the items are scored on a 1 – 5 scale and the maximum of many of the variables is 8. Occasionally for perfectly good reasons other than to annoy you, people code data as 8, 99 or whatever to show that it was “not administered”, “not answered” , “did not know” and so on. There are some interesting analyses that can be done of patterns of missing data, but now is not the time to do them. If you want to use those missing value codes later, good for you, but given how many times these have caused completely incorrect results, you probably want to set these out-of-range values to missing in your analytic file. If you do need an analysis on non-respondents by type, you can always go back to the raw data file and read it in with the missing codes.
PROC FREQ DATA = lib.g8_student07 NOPRINT ;
TABLES idvar / OUT = studentfreq (WHERE = ( COUNT > 1 )) ;
FORMAT idvar ;
This is the step where you identify duplicate ID values. In almost every circumstance, you are going to have a unique identifier that is supposed to be, well, unique. This can be a social security number, transaction ID, employee number, whatever. Be sure you don’t forget the NOPRINT option !!! If you happen to leave off the WHERE clause in the next statement or make some other error, the last thing you want is a frequency table of the 2,000,000 user ids printed out. The frequencies are going to be output to a file named studentfreq. Only values where the count is greater than 1 will be written to that dataset. Also, note that there are two sets of parentheses in the WHERE clause.
I want to force it to use the unformatted value, so included a FORMAT statement with the variable name, followed by no format whatsoever.
The next step prints the first ten duplicate ID numbers.
PROC PRINT DATA = studentfreq (OBS = 10 ) ;
I used the obs = 10 option because just in case I used the wrong variable, or I forgot the count > 1 or accidentally typed count = 1 or one of a bunch of different reasons I might have gotten 2,000,000 records in this dataset, I don’t want them all printing out.
If you are working with large data sets on a server, you may go over the disk quota for your account with errors like this, certainly if you are using SAS interactively you will run into problems.
There is a second option for getting a first look at your data, and that’s using the CHARACTERIZE DATA task in SAS Enterprise Guide. You can find it under TASKS > DESCRIBE > CHARACTERIZE DATA
This won’t give you the exact same results as above. What it will give you is a frequency distribution of the 30 most common values for each categorical variable and summary statistics (N, number missing, mean, minimum, maximum and median) for each numeric variable.
It also can give you graphs of each variable’s distribution and save datasets of the descriptive statistics and frequencies. I don’t usually do that. With the size of datasets I use with hundreds of variables, the graphs take a lot of time and don’t provide much useful information.
The datasets can be useful. For example, I could open up the descriptive statistics dataset and sort by the number of missing observations to see which variables had a high percentage of missing data.
That, however, is getting to step two, which is more detailed analysis of your data quality.
Even though these steps are basic, I have a reason for emphasizing them year after year, and it is the exact same reason that I drilled my daughters on their multiplication tables. Over-learning. Things you must know you need to have repeated until you will never forget. I haven’t been in second grade for many decades and yet I still know that 12*12 = 144. These steps are that important. If you doubt me, just imagine that screaming client and his jabbing finger.
Tomorrow I may write about step 2 of data quality. Or maybe I’ll write about Part 1 of my talk. (Aren’t you even a little curious what comes before this?)
Or maybe I won’t do any of that, because Maria just flew in from Boston with my granddaughter who, coincidentally, is the exact same age Maria was when I attended that first SAS conference. I told you it was a long time ago.
Jul
21
Quit Being a Dick, Cowboy Up and Pay Your Taxes
July 21, 2011 | 31 Comments
We’re one of those small businesses that congress claims to care so much about. We’re also, many years, a couple that makes over whatever the limit is for raising taxes. Whenever we see a tax credit, whether it’s for child care, or alternative minimum tax or whatever, we just ignore it because we know we don’t qualify. Even though The Julia Group is 100% Hispanic female owned, we don’t qualify for Section 8(a) funding as a disadvantaged business because our household net worth is too high.
We pay A LOT of taxes. Corporate taxes. State taxes. Federal taxes. I even pay taxes to the Spirit Lake Dakota Nation, because we do business there.
So, I guess this is the part where I’m supposed to scream that I shouldn’t pay taxes and whine that I can’t create jobs because of taxes.
In North Dakota, the guys on the rodeo would tell each other to “Cowboy up”. I think that means quit being a whiner and do what is expected of you.
My former partner would tell his sons, “Be a man”. I think he meant the same thing.
I’ll just say, all of you people who are making over $200,000 a year, some of you making over $200,000,000 a year should just quit lying, quit being a dick and pay your taxes.
Here are some of the reasons I don’t complain about paying taxes:
1. My husband and I have six degrees between us. Five of those were from state institutions very heavily subsidized by taxes. Without that education, we wouldn’t be making this money.
2. I pay taxes to the Spirit Lake Dakota Nation because I do business there. The law enforcement, sanitation, roads department and every other service I use when I am there have to be paid for by someone. Why shouldn’t I pay part of it? My income is many times the average resident on the reservation, why should they all pay and not me?
3. The resident rocket scientist works on government contracts. There aren’t a lot of rocket scientist wanted ads in the LA Times. Be glad he does, too. He’s brilliant.
4. I do a lot of government work. This year, it’s probably 50% of our business. My salary and corporate profits come in part from those taxes.
5. I live in a beautiful place, by the beach, in a safe neighborhood. Someone has to pay for the lifeguards, police, fire department, sanitation, libraries.
6. I use a lot of government resources. When doing research, I’m continually accessing data collected by the Census Bureau, the Department of Education or the National Institutes of Health or the National Science Foundation. The research they fund enables me to make money.
I know that all of these factors are true for me and I find it impossible they aren’t much more true for someone making $200 million a year.
So quit lying about how government is the problem, quit being a dick, cowboy up and pay your taxes.
Jul
18
Fixing data the easy way, part 2
July 18, 2011 | 3 Comments
I have learned not to be too smart for my own good. Yesterday was an example.
My client provides many different types of services to the consumers who use their program. There are about 15 different options, from counseling to on-the-job training to assistive technology. We want to get the total number of services each participant in their program receives and the average number for the total customer base. However, there are fifteen different variables and for each one, the data are entered “Yes” for if the customer received it and “No” if they didn’t. Of course, these data being entered by real human beings and not computers, the data are actually entered, “Yes” or “yes” or “YES” or “Yes ” . You get the idea. And, of course, all of those yes’s are interpreted differently by the computer.
My first (dumb) thought was to do two array statements, create 15 different variables, all numeric and use a DO-loop to convert to 0=No, 1 = Yes. Then I could sum those variables to get how many services each customer received. I could apply a PROC FORMAT so when the report printed out it would print as “YES” and “NO”.
There is a much easier way.
Data services ;
set report7_11 ;
Array nums{*} counsel -- Other_services ;
num_services = 0 ;
do i = 1 to dim(nums) ;
nums{i} = upcase(nums{i}) ;
if trim(nums{i}) = "YES" then num_services + 1 ;
end ;
Let’s look at each of these statements ….
Data services ;
set in.consumers7_11 ;
Well, obviously, the first one creates a data set. The second statement reads in data from a permanent SAS dataset.
Array nums{*} counsel — Other_services ;
This creates an array of all of the variables in the dataset from the “counsel” variable to the “Other_services” variable. Conveniently for me, and for the data entry personnel, their dataset was set up so these were 15 consecutive variables.
num_services = 0 ;
This sets my num_services variable to 0. If you DON’T do this the value will be retained from one iteration of the data step to the next. Your first record will be correct, but for the second person you’ll get the number of services they received plus the number the first person received. Bad!
do i = 1 to dim(nums) ;
Obviously, sets up the DO loop to go from the first variable to however many variables there were in the array. I don’t know if there were exactly 15 and I was too lazy to count them.
nums{i} = upcase(nums{i}) ;
This changes all of the values to upper case.
if trim(nums{i}) = “YES” then num_services + 1 ;
This trims the trailing blanks for the values and, if the result is “YES” (notice it is now upper case with no trailing blanks) the value for num_services is increased by one.
end ;
End the DO-loop and we’re done.
Jul
17
Stripping data the easy way in SAS
July 17, 2011 | 1 Comment
Been working with SAS a lot lately to get the data in shape for a new client. Here’s one of the problems I ran across.
We want to get the average pay rate for people who got jobs as a result of our client’s services. However, when one of the people they served did not get a job, the pay rate is listed as “no”. When the person did get a job it’s listed as something like $11.00/hr .
Obviously, this is a character variable, and equally obviously, what I would like to do is:
- Change the “no” values to missing.
- Remove the “$”, “/” and “hr”
- Change the variable type from character to numeric
Why not just do a couple of Replace commands in Excel?
We have a contract with this client for the fiscal year and we expect, like almost all of our clients, that this contract will be renewed and we’ll work with them for years to come. I’ll have to do that replace every month, unless I get them to change the way they enter their data and I think that is bad service to ask our clients to change for my convenience. They’re paying us to make life easier for them, not the other way around.
Here’s how easy this is to fix
Data services ;
set in.report7_11 ;
attrib jobpay length = 8 ;
jobpay = compress(earnings,'$/hrno') ;
The ATTRIB statement creates a new variable that is numeric. The COMPRESS function strips all of these characters out of the value for the earnings variable, ‘$/hrno’ /
Done.
Jul
14
SAS and SPSS Give Different Results for Logistic Regression but not really
July 14, 2011 | 6 Comments
When people ask me what type of statistical software to use, I run through the advantages and disadvantages, but always conclude,
“Of course, whatever you choose is going to give you the same results. It’s not as if you’re going to get a F-value of 67.24 with SAS and one of 2.08 with Stata. Your results are still going to be significant, or not, the explained variance is going to be the same.”
There are actually a few cases where you will get different results and last week I ran across one of them.
A semi-true story
While I was under the influence of alcohol/ drugs that caused me to hallucinate about having spare time during the current decade, I agreed to give a workshop on categorical data analysis at the WUSS conference this fall . After I sobered up (don’t you know, all of my saddest stories begin this way), I realized it would be a heck of a lot easier to use data I already had lying around than go to any actual, you know, effort.
I had run a logistic regression with SPSS with the dependent variable of marriage (0 = no, 1 = yes) and independent variable of career choice (computer science or French literature ). There were no problems with missing data, sample size, quasi-complete separation, because like all data that has no quality issues, I had just completely made it up. I thought I would just re-use the same dataset for my SAS class.
So, here we have the SPSS syntax
LOGISTIC REGRESSION VARIABLES Married
/METHOD=ENTER Career
/CONTRAST (Career)=Indicator
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
As I went on at great boring length in another post, if you take e to the parameter estimate B, ( Exp(B) in other words) you get the odds ratio for computer scientists getting married versus French literature majors, which are 11 to 1.
Also, I don’t show it here but you can just take my word for it,
the Cox & Snell R-square was .220 and the Nagelkerke R-square was .306 .
If you are familiar with Analysis of Variance and multiple regression, you can think of these as two different approximations of the R-squared and read more about pseudo R-squared values on the UCLA Academic Technology Services page.
So, I run the same analysis with SAS, same data set and again, I just accept whatever the default is for the program.
proc logistic data = in.marriage ;
class cs ;
model married = cs / expb rsquare;
If you look at the results, you see there is an
R-squared value of .220 and something called a Max-rescaled R-squared of .306
Okay, so far so good, but what is this?
For our parameter estimate for both the intercept and our predictor variable we get completely different values, and, in fact, the relationship with career choice is NEGATIVE but the Wald chi-square and significance level for the independent variable, is exactly the same. (This is what we care about most.)
The odds ratio is different, but wait, isn’t this just the inverse? That is .091 is 1 /11 so SAS is just saying we have 1:11 odds instead of 11:1
Difference number 1: SAS uses the lower value as the reference group, for example NOT being married.
That’s easy to fix. I do this:
Title "Logistic - Default Descending" ;
proc logistic data = in.marriage descending ;
class cs ;
model married = cs / expb rsquare;
This is a little better, The two R-squared values are still the same, the odds ratio is now the same, at least the relationship between the CS variable and marriage is now positive. You can see the results here or the most relevant table is pasted below if you are too lazy to click or you have no arms (in which case, sorry for my insensitivity about that and if you lost your arms in the war, thank you for your service <– Unlike everything else in this blog, I meant that.)
BUT, the parameter values are still not the same as what you get from SPSS and Exp(B) still does not equal the odds ratio.
Since actual work is calling me, I will give you the punch line thanks very much to David Schlotzhauer of SAS,
“If the predictor variable in question is specified in the CLASS statement with no options, then the odds ratio estimate is not computed by simply exponentiating the parameter estimate as discussed in this usage note:
http://support.sas.com/kb/23087
If you use the PARAM=REF option in the CLASS statement to use reference parameterization rather than the default effects parameterization, then the odds ratio estimate is obtained by exponentiating the parameter estimate. For either parameterization the correct estimates are automatically provided in the Odds Ratio Estimates table produced by PROC LOGISTIC for any variable not involved in interactions.”
So, the SAS code to get the exact same results as SPSS is this (notice the PARAM = ref option on the class statement)
Title “Logistic Param = REF” ;
proc logistic data = in.marriage descending ;
class cs/ param = ref ;
model married = cs / expb rsquare;
Did you notice that the estimate with the PARAM = REF (the same estimate as SPSS produces by default) is exactly double the estimate you get by default with the DESCENDING option? That can’t be a coincidence, can it?
If you want to know why, read the section on odds ratios in the SAS/STAT User Guide section on the LOGISTIC Procedure. You’ll find your answer at the bottom of page 3,952 (<— I did not make that up. It’s really on page 3,952 ).
Jul
12
6 things about statistics everyone must know
July 12, 2011 | 5 Comments
What must everyone know and why must everyone know that?
This was one of the two central questions in Dr. J.T. Dillon’s course on Curriculum and Instruction, which I thought was going to be a colossal waste of time. After all, I was going to be a statistician, why would I need a course on curriculum, the most boring topic on earth? Like most of the courses I thought would be a waste of time and which the university in its infinite wisdom require I take, they were right and I was wrong. (Why would someone who was planning a career as a professor need to know something about teaching, gee, I can’t imagine!)
Usually when I am asked if I’d be available to teach a course, I say, “No”, partly because it doesn’t pay that well but mostly because I usually am NOT available unless you ask at least six months in advance. Well, they did. I haven’t taught a graduate statistics course in a few years, so I’m really looking forward to it. Most of these doctoral students will be writing a dissertation and then, for the rest of their careers, reading research and making decisions based on their evaluation of that research. None of them are mathematics or statistics majors and none are planning to do a lot of scientific research themselves. The question then, is what do they need to learn?
1. Descriptive statistics, distributions and data visualization – In analyzing their own data they need to get a feel for it. They need to understand what the average person is like, the variance among their population, and identify the outliers.
2. Correlation and regression – A basic understanding and application of statistics is the knowledge of relationships, how you measure relationships, how you interpret them.
3. Group mean differences – Yes, mathematically you can couch this as a regression problem and they should probably understand a little about that. Definitely need to know how to compare groups.
4. Hypothesis testing – Understand the difference between statistical significance and practical significance. This is usually an intuitive concept when teaching physicians because they are familiar with the idea of something being clinically significant. Other professions don’t always get this so easily. To understand statistical significance, you need to know something about probability. To understand probability, it helps to know something about combinations and permutations.
5. How to analyze categorical data – Sometimes the world realize does fit in neat little boxes. You have never had cancer, you have cancer now or you are in remission. You’re married or you’re not (and no, that weekend in Cabo doesn’t count, unless you were visited by the Holy Ghost) . For those times you need to understand chi-square and logistic regression. Maybe a little bit of odds ratios.
6. How to use some kind of software to compute results – It would be nice to have minions to do this for you. If you have the budget you can hire someone like me. Sadly, “graduate student” is one of the lower paid occupations, so it behooves the students to learn how to do at least some of the analysis themselves.
So, there’s my syllabus. I still don’t have a textbook. I’m debating on The Statistical Sleuth , supplemented by some other resources.
Anyone else have any ideas for topics, exercises or readings, please dive in.
I’m really looking forward to this semester. It’s going to be great!
Jul
12
Two incidents happened lately where results I expected to be the same turned out to be different. The first one involved two Excel files sent to me from one of my favorite clients. (You’re all my favorite clients!)
One of the files was an Excel file on a Mac. The other was created on a PC. Then, both files were opened on a Mac and the rows from one file pasted into the other. It all looked lovely when I got it except for the fact that many of the dates were for 2013 and beyond, which didn’t make sense for, say, the date an application was received.
If you use Excel a great deal, you might already have been aware of the fact that Macintosh and Windows may use different calendars for dates. The default for the Macintosh is to start with January 1, 1904. The default for windows is January 1, 1900. If you paste data from a Macintosh file into a PC file you can end up with dates that are off by four years and one day.
You can correct this in Excel. Microsoft gives a solution here:
You can correct shifted dates by following these steps:
- In an empty cell, enter the value 1462.
- Select the cell. On the Edit menu, click Copy.
- Select the cells that contain the shifted dates. On the Edit menu, click Paste Special.
- In the Paste Special dialog box, click Values. Then, select either of the following option buttons.
Select this If -------------------------------------------------------------------- Add The dates must be shifted up by four years and one day. Subtract The dates must be shifted down by four years and one day.
- Click OK.
Repeat these steps until all of the shifted dates have been corrected.
As Wade Lasson noted on his blog, that takes WAY too many steps if you have 60,000 dates and it is not just as simple as changing the calendar you are using.
In my case, I couldn’t just go to Excel on my Mac, select PREFERENCES , select CALCULATIONS and click off the 1904 Calendar box because then the other half of the records (that came from the PC) would be off by four years and one day in the opposite direction.
Here is one really quick way to fix it using SAS.
First, figure out which of the variables are dates by looking at the output from a PROC CONTENTS.
Second, assuming, as I did, that you have some way of knowing which file had the dates that were off, do this:
DATA newdates;
SET oldfile;
ARRAY fix{*} list of date variables ;
IF site = "Macintosh Site" then DO I = 1 TO DIM(fix) ;
fix{i} = fix{i} +1462 ;
END ;
What if you have hundreds of date variables?
I hate to type. If you have hundreds of date variables, do this:
PROC CONTENTS DATA = datasetname OUT= test ;
PROC PRINT DATA = test NOOBS ;
VAR name ;
WHERE INDEX(format,"DATE") > 0 ;
This will put the list of date variables in your log. Copy and paste them after the ARRAY statement. That is all.
Oh, don’t whine because you had to spend 5 seconds scrolling down to copy something.
Jul
8
I’m working on the SAS Naked Mole Rat book and was writing about residual variance. It occurred to me that I just threw in a chart and assumed that people would know what residual variance was and moved right along. This reminds me of the time the rocket scientist had a sign outside of his office that said,
“Heisenberg may have slept here.”
He was surprised that no one laughed. He said,
“I’d think everyone would have heard of the Heisenberg Uncertainty Principle.”
Uh, no
Below you can see where I plotted the residuals the predictor (pretest score) for a dichotomous variable, passed course, coded as yes or no.
Nice graph. What does it mean, exactly? Let’s start with what residual is. The residual is the actual (observed) value minus the predicted value. If you have a negative value for a residual it means the actual value was LESS than the predicted value. The person actually did worse than you predicted. If you have a positive value for residual, it means the actual value was MORE than the predicted value. The person actually did better than you predicted. Got that?
It’s worth belaboring this point because I think it is a bit counter-intuitive. Or maybe it’s just me. So, look at the line above, which is at zero. If there is a residual error of zero it means your prediction was exactly correct. Under the line, you OVER-predicted, so you have a negative residual. Above the line, you UNDER-predicted, so you have a positive residual. Naw, it isn’t just me. That really is counter-intuitive, but that’s the way it is. (Ryan Rosario, on twitter, source of all knowledge second only to wikipedia (twitter in general, that is, not Ryan specifically, although he does seem to be pretty smart), pointed out that one way of thinking of it is that a positive residual error means your actual result is OVER the predicted one, that you OVER-achieved. So, it may make sense in a twisted way.)
Back to the graph – the x-axis is labeled “z_total_pre”. It is the z-score for the total on the pre-test. As you might remember from your first statistics course, a z-score is a standardized score where a 0 is the mean and 1 is the standard deviation. So, a person with a z-score of 0 is exactly average, a negative z-score is below average and a positive z-score is above average. Assuming a normal distribution, a person with a z-score of +1.0 scores above 84% of the population, that is, in the top 16%. A person with a z-score of -1.0 is in the bottom 16%. A person with a z-score of +2.0 is in the top 2.5% and with a -2.0 z-score in the bottom 2.5%.
If you have an unbiased predictor, you should be equally likely to predict too high or too low. The mean of your error should be zero. It should also be zero at all points. No one is going to be too happy if you say that your predictor isn’t very good for “A” students. However, it seems that is exactly what’s happening. In fact, past about 1.5 standard deviations above the mean, ALL of our students have been over-predicted. Let’s look at our graph again. I have helpfully pasted it below again so that you don’t have to scroll all the way back. That would be annoying. Besides, I get paid by the page.
We have two students who scored very low on the pretest, around -2.5 standard deviations. We used their pre-test scores to predict whether the student would pass or fail. One of them had an actual score on the greater than predicted – a residual of about .65 , the other had an actual score less than the predicted, with residual error around -.2 .
As you can see, past a z-score of about 1.5 all of the residual values are negative. That means that for the top 10% or so of the students, we had an actual value that was less than the predicted value. So, this is NOT an unbiased predictor. It is biased in favor of the high-scoring students. I am bound by the Statistician’s Code to consider bias bad.
Let’s think about residual error in a hypothetical situation. Let’s say we have prediction for males and females and that the females tend to have their performance over-predicted (just like the high scorers in our example above). Let’s say that based on your score, you, a male, were denied admission to the program. Could you argue, based on statistical evidence, that the selection test was biased in favor of women. Why, yes, you could.
There are other things I could say about residuals, but the rocket scientist just came home, bearing groceries and a bottle of wine. Maybe I will go drink Chardonnay and talk to him about Heisenberg.
Jul
7
Life imitates research: How cultural programs MAY detract from academics
July 7, 2011 | Leave a Comment
Yesterday, I was at Griffith Observatory with the world’s most spoiled 13-year-old. We were there because she is in a Summer Scholars program that is designed to provide academic enrichment for high-achieving girls. Unfortunately, not being as high-achieving as we might like, she got a B- on her latest test, and given that the teacher had offered extra credit to anyone who went to the observatory and came back with a star map, off we were. I presume there was some assumption that you would look around the observatory and not just pick up the map at the gift shop, eat in the cafe and head for home, which was the original plan of The Spoiled One, foiled by yours truly.
So, we walked around the observatory for a while, stood on the different scales to see our weight on different planets, discussed differences in gravitational pull, Galileo, Kepler, Michelangelo and Da Vinci. I didn’t know who painted the Mona Lisa but I did know it wasn’t Michelangelo. She thought it was Da Vinci, looked it up on the wikipedia app on my iPhone and sure enough, she was right. I knew that Galileo was the one who had the correct idea about the earth rotating around the sun. I did not know what Kepler was incorrect about but I vaguely remembered he was important for some reason, so I figured he must have had some views that turned out to be correct (it’s been nearly 40 years since I took high school science, give me a break). When we got home, there was a discussion with the resident rocket scientist on Kepler’s laws, including his disproven third law.
“I knew it!”
She was profoundly satisfied to have been correct twice in one afternoon.
This summer, my daughter is studying about plate tectonics, constellations, the Renaissance scientists and artists. This morning, she was discussing with her father the chemical reaction that occurs when a sparkler, uh, sparkles , and asking for the correct spelling of the word, “oxidizer”. Don’t get me wrong, she complains BITTERLY about not spending every waking moment of her summer at the beach and the mall. Just last night, she got a new computer game, “Sims go to the moon for pizza” or something like that, because she had been doing so much extra work.
BUT …. I contrasted this with some of the cultural programs I have evaluated. In some of these, the students went camping in the woods for a week or two. Truly a fun thing to do. They watched movies about famous Native Americans or about important events in Native American history. They had lessons in their native language.
I’m not saying none of that is of value. When I was young, my mom (who has a Botany minor), would go camping with us and tell us things like “That flower is called Baby’s Breath. They use it in bouquets at weddings a lot. This is milkweed. You’ll find Monarch butterflies around it.”
Did learning (and forgetting) about Kepler help me any more in my life than knowing about milkweed? Obviously not. Although since I haven’t been camping in over 30 years, it probably didn’t help me any less, either.
Actually, to be honest, information like what my daughter is learning did help me, because it was on standardized tests I took that helped me to get scholarships and admitted to college. The more programs for kids, like the Summer Scholars program, teach information specific to the test, the better they’ll do on the tests. Duh! This isn’t one of those rants about how bad “teaching to the test” is, either. Lately, she’s asked me what the word “enigmatic” meant and why people say Mona Lisa had an enigmatic smile, and did I really think Michelangelo was gay because he sculpted a lot of naked men, because she read that on some website. We’ve gone out to the park, laid in the grass at 10:30 pm, looked at the stars and talked about constellations and light pollution. She’s doing more than memorizing facts to regurgitate.
I’m not saying cultural programs are bad, either. In particular, I remember an outstanding program done on the Spirit Lake Nation several years ago where the students spent the summer testing water quality at the lake, learning how to analyze water samples, measuring the water level and graphing trends over years. You can have programs that both address issues of historical and traditional value to your culture – like caring for the environment, how a treaty that sets the lake as a reservation boundary can have significant implications – and be a really kick-ass educational program at the same time. What I AM saying, though, is that programs like that don’t happen by accident and that it requires some creativity, an in-depth knowledge of the content areas like science and history, and most importantly, a deliberate effort to blend students’ culture into the curriculum, rather than replacing academic subjects already being taught.
Because, as my daughter is constantly complaining, there are only so many days in the summer.
keep looking »
Blogroll
- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog