It’s that time of year again when we hear complaints about how terrible the U.S. is doing in math. This article by The Atlantic with the title American Schools vs. the World: Expensive, Unequal, Bad at Math is just one of many, many reports that showed up in my twitter stream.
The first question anyone who has had more than a week of statistics should ask is about the sampling. Didn’t anyone notice that the top “country” is not a country at all but a city. The article goes on to say “parts of China, Japan, Korea and Liechtenstein topped the ratings… ”
Wait, what? PARTS of China?
As a statistician, I am intensely interested in just how representative these parts of China might be. So, I track this down in another article, since many of those in the U.S. didn’t bother to investigate, I tracked this information on the demographics of the sample down from an article written for The Guardian by a correspondent in Beijing,
Its population is less than 2% of the country’s total, and its per capita GDP is more than twice the national average. According to Tom Loveless, an expert on education policy at Harvard University, 84% of its high school graduates enroll in college, compared with 24% nationwide.
I’m not saying that U.S. education could not use improvement. (Although the more I work with the Common Core Standards for mathematics, the more I’m convinced they have been over-hyped. ) What I am saying is that calls for improvement should be based on well-reasoned arguments that have some understanding of science – random sampling being a good start. It is also worth noting that the same article in The Guardian mentioned that 12 provinces in China had students take the test but only the results for one CITY were released.
There were also several articles that discussed Finland “slipping”, math scores 7 points, from 548 to 541. A lot of potential explanations and remedies were given in this Washington Post blog, sadly, none of which mentioned regression towards the mean, sampling error or the nature of the test itself. Also, there is the question of how much in absolute terms that difference really means. Children don’t answer 600 questions. They’ll answer a test with 50 or 60 questions which will then be normed to, say, a mean of 500 with a standard deviation of 100. That drop may well represent that the average child answered 43.7 questions correctly last year and 43.2 questions correctly this year.
Then, there are the host of articles that went on about how the sky is falling because countries that have low PISA scores are also those who do poorly economically and therefore China is going to eat our lunch. Don’t even get me started on correlation and causation.
All of this has led me to conclude that the PISA data are unclear on whether or not American (and Finnish) children are really doing that terribly in math, but have led to the firm conclusion that most journalists’ articles that I read could certainly use a refresher in statistics. (See how I did not generalize to all journalists in the entire world? Take note.)
In fairness, the very same Washington Post had another blog on how public opinion is being manipulated using the PISA findings. Is it or is it not an outlier? Discuss.
This month, I’m teaching biostatistics for National University, and so far I am really enjoying it. There is just a really minor problem, though. While I received a copy of the textbook, I did not receive a copy of the instructor’s manual with answers to the homework problems. Since I am going to grade 20 people based on whatever I get, I need to be 100% correct in everything and it is taking up my time to computer Cumulative Incidence for the population, cumulative incidence for people with hypertension, population attributable risk - and I am busy.
So, check this out, and all of you epidemiologists, I am sure this is old hat to you …. I had a table that gave me the number of people who were and were not hypertensive and whether or not they had a stroke in the five years they were followed. I wanted cumulative incidence for those with hypertension, those without and the population attributable risk.
And here we go …..
DATA stroke ;
INPUT Event_E Count_E Event_NE Count_NE;
18 252 46 998
proc stdrate data=stroke
population event=Event_E total=Count_E;
reference event=Event_NE total=Count_NE;
All I need to do is create a data set where I give the number of people who were exposed, (in this case, who had hypertension) who had the event, a stroke, in my example, and the total number of exposed people. Then, the number not exposed (that is, not hypertensive) who had the event, and the total number not exposed.
I just invoke the PROC STDRATE giving it the name of my dataset and specifying that I wanted risk as the statistics.
In my POPULATION statement, I specify that for the population of interest, people with hypertension, the number who had the event was found in the variable Event_E and the total number was in Count_E .
In my REFERENCE statement, I give the number who had the event and the total number for people who were not exposed to the risk factor.
I’m looking forward to teaching my first masters level course in a lo-o-ng time next week. Since this may be the first course students take in their masters program, the question I’m faced with is,
“What would you tell someone at the very beginning of learning about statistics?”
I’m starting with this:
Bias = bad
Bias is to statisticians as sin is to preachers. We’re against it.
Bias is SYSTEMATIC error. While it is generally impossible to avoid error, in an unbiased study, error will be random.
Random = good
If error is random, we would be equally likely to err in one direction as the other, and so, on the average, would get the correct result. For example, if I was evaluating fighters to decide if they really did have brain damage as a result of being hit in the head too many times, in some borderline cases I might incorrectly decide the fighter was fine when, in fact, there was some minimal brain damage. In other cases, I might decide the person had damage, when he or she was just somewhat on the low side of the bell curve in terms of functioning brain cells. On the average, though, those errors should balance out and I should get the correct conclusion.
Random assignment is good because it means that people are equally likely to be assigned to one group versus another, so it is likely to control for confounding variables. What are confounding variables? Those are factors that may have complex relationships that distort the relationships found between your predictors/ risk factors and outcome variables. For example, people residing in nursing homes (my predictor) may be more likely to die (my outcome) but that might be because they are older or in poorer health (confounding variables).
Random selection is good because it means that everyone in the population has an equal chance to be selected, which means that, if you have a large enough sample, your sample is likely to be representative.
What’s a sample? What’s a population? What’s representative?
Well, we’ll get into that shortly.
But, speaking of random, I thought the most important thing to begin with was not how to find a mean or standard deviation but that bias is bad, because if you have bias, you are worse off after you found the mean than before you knew how to compute it. Before you didn’t have any information, you didn’t know the mean and you knew you didn’t know it.
With bias, you still don’t know the mean, but you think you do. You’ve actually gone backwards.
Think about it.
Why do we still teach systematic random sampling as an option?
As you may recall from your Statistics 101, simple random sampling is when you select from the sample at random. So, if you want 100 people out of a sample of 10,000 in a dataset, you would pull a random sample by, most likely, using a random number function.
In a systematic sample, you select the first number at random. Say you want 1% of the population, like our 100 out of 10,000 example. You would select a number between 1 and 100 and then you’d select every 100th person after that. So, if your number was 98, you’d select person #98, 198, 298 and so on.
The danger is always that your data may be in some sort of systematic order. For example, if you had collected students tests and entered them, you might have classes where teachers seated the students alternating boys and girls. So, you might get a whole bunch of boys or a whole bunch of girls.
Yes, systematic sampling is easier if you are pulling the data by hand, but given that almost all samples in any real-life use are pulled by a computer, why do we even teach systematic sampling? If you have a data set already created, I can’t think of any benefit to systematic random sampling. What, it takes the computer .0015 seconds instead of .0007 seconds?
The only possible benefit I could imagine was if you were sampling people as they came through the door of a clinic or shopping mall. In that case, it might be easier to start with a person at random and then badger every Nth person.
Even that doesn’t really make sense to me, though. It seems that in those settings, you have a hard enough time getting people to stop and talk to you, you ought to just try to grab every person you can get.
Let’s say you are doing an observational study, though, of public behavior. In this case, you really could do a random sample and, theoretically, a systematic random sample would be much easier than trying to keep track of whether that person who walked by was #17 and the next was #134. Still, even this specific situation is no excuse. It would be really easy to write a program that makes your phone beep at random times, a minimum of some fixed duration apart (the time you plan on observing each person) and at each beep you could observe the person who was walking by, browsing in front of your store window, or whatever it is you’re observing. I’m positive those programs already exist so you wouldn’t even actually need to write it yourself.
It just seems to me like systematic random sampling is an idea whose time has gone the way of Roman numerals.
Hey, if you are a furloughed federal employee looking for something free to do on Thursday because you still haven’t received your back pay that Congress promised you, you can drop on into Salem, Oregon to the Oregon SAS Conference. I will be speaking on Categorical Data Analysis, Telling Stories with Your Data and How Factor Analysis is Your Friend. Not all at the same time – I’m giving three talks. They promised me that none of them would occur before 10 a.m.
The first one is pretty basic, pointing and clicking to do a little exploring of a small pilot study. If you are not feeling all warm and fuzzy about statistics, come to that one. On the other hand, if you really have been dying to have someone to discuss ROC curves and eigenvalues with, come to one of the other two. Even the more statistical presentations are pretty user-friendly, though. There will be lots of pictures. Also, beer.
The beer isn’t actually part of the free conference, but if you hang out with me afterwards I’m pretty certain we can find some.
While I love teaching and am looking forward to be working in a completely new environment – teaching an online course to masters students – I was initially concerned that teaching a course on biostatistics in public health might draw too much time away to my work for The Julia Group. I really should have known better. Statistics is statistics, and as I’m reading the textbook preparing for my lectures in November, it is actually very relevant to the research design we are implementing now.
Randomized controlled trials are not an option for us. We work either with schools, testing our 7 Generation Games educational software, or with social service programs that are mandated to provide an intervention for those who qualify. On the other hand, the controlled part is very relevant to our work. For example,
“… investigators must be sure that participants are taking the assigned drug as planned and not taking other medications that might interfere with the study medications…”
You might wonder how this relates to educational research …
When we collect data on our games, we have a timestamp recorded with each answer. This was extremely useful when we were looking at the different classrooms that used our game, because some had better outcomes than others. We were able to estimate, from the timestamp on the first problem answered by a student in the class to the last, about how long students in that classroom had the opportunity to play the game. Not all were ‘taking the program as planned’. There might have been an early dismissal on several days because of hazardous weather (our initial testing occurred in North Dakota in the fall and winter), so students in that class only played the game 2/3 of the time students in the other classes did, because their math class was the last period of the day, and others had math before lunch.
What was not mentioned in the text, but is equally important, and I’ll address in lecture, is whether it is a drug study or an educational intervention, you also need to be sure you assess your DEPENDENT variable correctly.
Recently, I was talking to someone from a school that used a mathematics program, a competitor to us, that had demonstrated very good results. The test scores at the school had risen dramatically, and yet, to my surprise, the math department was very interested in having us come install our program. I found out that in the year before they used the program, students were given a set amount of time for their standardized tests. The year they began using the program, students were allowed unlimited time to complete the same tests. They were even allowed to come back the following day and finish up where they had left off. The math teachers working with these students were very aware that the students’ skills needed major help, regardless of what the tests might say.
While this has to do with education and not health care, the same applies. If your dependent variable is blood pressure, cholesterol or blood sugar, you need to make sure that it was measured accurately AND UNDER THE SAME CONDITIONS, for both groups and at pretest and posttest.
I’m amused when people make comments like, “I’ve forgotten more statistics than you’ll ever know.”
Personally, I try NOT to forget, which is why teaching a masters level class every now and then is a good reminder of the basic principles. Just so you know that I practice what I preach, I’m on a flight to North Dakota right now where I will be meeting with principals and teachers of our intervention schools to discuss the importance of collecting data from all of the students at the same time in the same way.
Box and whisker plots can give you an understanding of your data at a glance – IF you know what you’re looking at.
The BOX extends from the 25th percentile to the 75th percentile. That line in the middle is the median, also known as the 50th percentile. The diamond inside the box is the mean. The whiskers, those two lines at either end, extend from the box as far as the minimum and maximum values, up to 1.5 times the inter-quartile range. The inter-quartile range is the distance from the 25th percentile to the 50th. In other words, each whisker MAY extend up to 1.5 times the length of the box. (Different software packages use different values for the whiskers. This is what SAS does.) If there are any outliers beyond 1.5 times the inter-quartile range, they’ll be shown as asterisks after the end of the whisker. In the t-test output, SAS also shades an area for the 95% confidence interval.
The example below is part of the output from a t-test task in SAS Enterprise Guide. It is from the control group in our pilot study of Spirit Lake: The Game. The value plotted is the difference between post-test and pretest. So …. you can see that the mean difference between pre- and post-test for the control group was close to zero. The median was a little bit above zero. There are no really extreme outliers, and the distribution is a little skewed to the left, with the mean to the left of the median. The most extreme difference for the control group was an increase from pretest to post-test of 11 points. We can also see that zero falls squarely in the middle of our 95% confidence interval, so we can accept the null hypothesis that no significant increase in performance on the math test occurred for the control group. This isn’t really unexpected – you wouldn’t really anticipate large improvements in mathematics performance over only eight weeks.
Let’s take a look at another box and whisker plot, this time for our experimental group in the same study.
We can see right away that the whole distribution has shifted to the right, and this time it is skewed to the right. The median looks to be at about four points higher on the post-test and the mean is above that. The 25th percentile is at zero, in other words, 75% of the students showed some improvement from pretest to post-test. The 75th percentile is a nine-point improvement for the experimental group, versus three or four points for the control group. It can also be seen that zero is not within the 95% confidence interval, not even particularly close, so we reject the null hypothesis that there was no improvement for the experimental group.
If we line the plots underneath each other, with zero at the same point, it is particularly easy to see that the improvement in scores from pretest to post-test for the group who played the game was noticeably higher than for the control group.
So, there you have it, a couple of brief looks at the data improves your understanding of the results.
What is item difficulty analysis and how is it helpful?
Item difficulty analysis is simply examining what percentage of students answered each item correctly. Item difficulty analysis is one basic way to establish test validity. One would expect that items at the second-grade level would have the lowest level of difficulty, being answered by the largest percentage of our students, and at the other end, the items at the fifth-grade level would have the highest difficulty , and be answered correctly by the fewest students. Since the items are scored 0 = wrong, 1 = right, we can use the means to see what percentage of students answered correctly. A summary table can give you a nicely formatted table for a report but here we’re just exploring our data, so using the univariate statistics you already have as a result of a CHARACTERIZE DATA task in SAS Enterprise Guide is easier.
1. Click on the univariate statistics data set produced by the CHARACTERIZE DATA TASK to select it,
2. From the top menu, select TASKS > DESCRIBE > LIST
3. From the variables to assign pane, select the ones you want in your report, in this case Variable, N, NMISS, Mean, Min and Max.
4. Select the records you want in your report. (If you want all of them, you can skip this step.) Now this part is a bit confusing because there is a variable named “variable”. Your univariate statistics data set has a column named ‘variable” and in it is the name of each variable for which you will be listing the N, NMISS, mean, etc. I only want the scored variables in my analysis, where they were scored 0 for incorrect and 1 for correct.
Click on EDIT from the button you can’t see in the screen shot above because I cut it off, but there really is an edit button, I promise. From the first drop down menu, select Variable, from the next select Not In A List, then click on the three dots to bring up a new window. In that window, click on the bottom left where it says Add Values. I selected q1 – q24,gender, missdata,age,pretotal, posttotal and usernum to drop. Click OK.
5. Format the columns on your report. This part is also optional but I personally find it easier to scan through reports without six decimal places in every mean. So, I change the format by right-clicking on Mean and selecting Properties. I click the CHANGE button next to format.
Then I click on Numeric for the format category, and scroll down to w.d. Under attributes, I put 8 for width and 2 for the number of decimal places. Then click OK.
6. Next, just to make the report even easier to read, I click Options and un-check the box next to Row Numbers
Click RUN to run the task
You don’t need to always export your output files to use them in some other program. I needed an xls file for an example, so at this point, I selected all of these data from the output open in Enterprise Guide and copied, and then pasted them into an OpenOffice Calc file (Excel would work just as well).
I sorted them in descending order and here is a partial picture of the result. I also changed the name from “variable” to “item” to make it less confusing.
It’s clear that the post-test and pre-test do not have the same number of people, so I need to be cautious of comparing them directly. However, within test comparisons are fine. The test items are in order of grade level, beginning with second-grade level through fifth-grade. The first few items should be answered correctly by the most people. We can see that is true both for the post-test and pre-test, although it’s not perfect. Three items at the second-grade level were answered by over 80% of the students who took the post-test. We can also see that, generally, a higher percentage of students answered the post-test questions correctly than the pretest, as we would hope.
If you could scroll down to the bottom, you’d find that items 5 and 6 have some of the lowest percentage correct of any item, so I make another note to examine those items in more detail.
Ever since my daughters referred to the day in April when we all take our kids to the office as “Bore your daughters at work day”, it has been clear to me that others do not find me as interesting as I find myself.
Based on this profound knowledge, when I was asked to talk for an hour about categorical data analysis, I decided my best plan was to not try to say everything I know about this subject. Hence, I narrowed it down to Part 1 and Part 2, with Part 3 if I have time.
Part 1: Explaining a research project in six pictures.
Part 2: Five lesser-known options
- Fisher’s Exact Test. How to get one. Why you want one.
- The difference between the Pearson, Likelihood ratio and Mantel-Haenszel Chi-square
- When NOT to compare chi-square values directly
- Tests of binomial proportions
- Summary tables for multiple variables
Part 3: Three clues that your logistic regression model sucks
- Model fit statistics
- Tests of Global Hypothesis
If you are madly excited to come hear me talk about this, you can come to San Diego on Wednesday, August 21st for the San Diego SAS Users Group or to Oregon Day on October 10th when I’ll be at the Oregon SAS Users Group meeting in Portland.
Or, if you are thinking to yourself,
Ha! I would like to hear about categorical data analysis from AnnMaria for three hours, and no children with mouths full of red jellybeans are going to dissuade me!
Well, then, you are in luck, my friend, because I will be giving a class in Las Vegas, November 12th at the start of the Western Users of SAS Software conference - a class I still cannot believe begins at 8:30 am. I think I said I would do Wednesday because I thought it was in the afternoon.
The only other thing that has gotten me up before 10 am is The Spoiled One’s soccer games – and I love her. So, this will be a rare sighting. I will be the lady with the cup of coffee bigger than my head.
This is the third and last part of my attempt to explain logistic regression in pictures. You can see a picture of odds ratios here, and a picture of two charts of predicted probabilities, to compare models, here.
If people only know one chart associated with logistic regression, it is usually the ROC chart, though many of them cannot tell you what ROC stands for (not that it really matters) or how to interpret the chart – which kind of does matter, because it’s useful.
The ROC curve is an abbreviation for receiver operating characteristic curve (I told you it didn’t matter). This is a plot of
SENSITIVITY – the percentage of true positives, the people we predicted would die who did, and
SPECIFICITY – or true negatives, the number of people we said would NOT die, who did not
We actually plot (1 – specificity) by sensitivity. If we predicted no one would die, our rate of true negatives would be 100%. Since we predicted nobody would die, we would be exactly right for all of the people who didn’t die. 1 – 1.0 = 0 so we’d be at 0 on the X axis.
On the other hand, we’d have zero sensitivity. Since we predicted no one would die, we would have zero true positives.
At the other extreme, if we predicted everyone would die, we would have 100% true positives and 0 true negatives. Since 1-0 = 1 , that would be at the upper right corner here.
The straight line is what we would get without any predictor variables, if we just randomly guessed whether a person would live or die. The top left corner, where we have correctly predicted all of our positives and all of our negatives is what we would get in a perfect model.
The more that curve is bowed toward the top left and away from the straight line, the better our model.
Let’s take a look at our actual curve from the Kaiser-Permanente data, where we used gender, age, number of emergency room visits and nursing home residence (yes or no) to predict whether or not a person would die within the next nine years.
From this, we can conclude that while our model is substantially better than random guessing – a conclusion that is consistent with what we saw in our previous charts. We can also see that there is definitely room for improvement. Perhaps future research could improve prediction by including behavioral risk indicators such as amount of alcohol and tobacco usage, as well as socioeconomic status and diagnosis of chronic illness.
So, there you have it - logistic regression in three blog posts and four pictures.