|

There is no substitute for real data

The second time I taught statistics, I supplemented the textbook with assignments using real data, and I have been doing it in the twenty-eight years since. The benefits seem so obvious to me that it’s hard to believe that everyone doesn’t do the same. The only explanation I can imagine is that they are not very good instructors or not very confident. You see, the problem with real data is you cannot predict exactly what the problems will be or what you will learn.

For example, the data I was planning on using for an upcoming class came from 8 tables from two different MySQL databases. Four datasets had been read into SAS in the prior year’s analysis and now four new files, exported as csv files were going to be read in.

Easy enough, right? This requires some SET statements and a PROC IMPORT, a MERGE statement and we’re good to go. What could go wrong?

Any time you find yourself asking that question you should do the mad scientist laugh like this – moo wha ha ha .

Here are some things that went wrong –

The PROC IMPORT did not work for some of the datasets. No problem, I replaced that with an INFILE statement and INPUT statement. It’s all good. They learned about FILENAME and file references and how to code an INPUT statement. Of course, being actual data, not all of the variables had the same length or type in every data set, so they learned about an ATTRIB statement to set attributes.

Reading in one data set just would not work, it has some special characters in it, like an obelus (which is the name for the divide symbol  – ÷  now you know). Thanks to Bob Hull and Robert Howard’s PharmaSUG paper, I found the answer.

DATA sl_pre ;

SET mydata.pretest (ENCODING='ASCIIANY');

Every data set had some of the same problems – usernames with data entry errors that were then counted as another user, data from testers mixed in with the subjects. The logical solution was a %INCLUDE of the code to fix this.

In some data sets the grade variable was numeric and in others it was ‘numeric-ish’. I’m copywriting that term, by the way. We’ve all seen numeric-ish data. Grade is supposed to be a number and in 95% of the cases it is but in those other 5% they entered something like 3rd or 5th.  The solution is here:

nugrade=compress(upcase(grade),'ABCDEFGHIJKLMNOPQRSTUVWXYZ ') + 0 ;

and then here

Data allstudentsents ;

set test1 ( rename =(nugrade= grade)) test2  ;

This gives me an opportunity to discuss two functions – COMPRESS and UPCASE, along with data set options in the SET statement.

Kudos to Murphy for a cool paper on the COMPRESS function.

I do start every class with back-of-the-book data because it is an easy introduction and since many students are anxious about statistics, it’s good to start with something simple where everyone can succeed. By the second week, though, we are into real life.

Not everyone teaches with real data because, I think, there are too many adjunct faculty members who get assigned a course the week before it starts and don’t have time to prepare. (I simply won’t teach a course on short notice.) There are too many faculty members who are teaching courses they don’t know well and reading the chapter a week ahead of the students.

Teaching with real, messy data isn’t easy, quick or predictable – which makes it perfect for showing students how statistics and programming really work.

I’m giving a paper on this at WUSS 14 in San Jose in September. If you haven’t registered for the conference, it’s not too late. I’ll post the code examples here this week so if you don’t go you can be depressed about what you are missing,

 

 

 

 

Similar Posts

2 Comments

  1. Good point. I think part of the challenge is the question of teaching statistics vs. programming or both. Often folks teaching stat want to teach whatever statistical methods, and feel like they don’t have time to teach programming. But it’s a tremendous dis-service to students. Many MS Stat recent grads don’t know how to handle real data, and they may know many SAS PROCS but not the DATA step. In my graduate program the stat courses were coordinated with a separate analytic programming course that taught the programming. Worked well.

  2. Quentin –
    I completely agree with you. In my graduate program, all stat classes had a 3 hour lecture and 3 hour lab where we learned to program using SAS. When I started teaching, the labs had been done away with and some how we were supposed to teach the same amount in 3 hours that we used to teach in 6, with a T.A. , “because the students are working full time” – which meant they learned less. I started teaching programming in my classes when I realized that I wouldn’t hire the graduates from programs where I taught because they couldn’t work with real data.

Leave a Reply

Your email address will not be published. Required fields are marked *