When I was running out to the airport, I said I would explain how to get a single plot of all of your scatter plots with SAS. Let’s see if I can get this posted before the plane lands. Not likely but you never know …

It’s super simple

proc sgscatter data=sashelp.heart ;
matrix ageatdeath ageatstart agechddiag mrw / group= sex ;

And there you go.

scatterplots

 

Statistical graphics from 10,000 feet. Is this a full service blog or what?

My life is upside down. All day, as my job, I spent writing a program to get a little man to run around a maze, come out the other end and have a new screen come up with a math challenge question. Then, in the evening, I’m surfing the web for interesting bits to read on multivariate statistics.

I’m teaching a course this winter and could not find the Goldilocks textbook, you know, the one that is just right. They either had no details – just pointing and clicking to get results – or the wrong kind of details. One book had pages of equations, then code for producing some output with very little explanation and no interpretation of the output.

I finally did select a textbook but it was a little less thorough than I would like in some places. I decided to supplement it with required and optional readings from other sources. Thus, the websurfing.

One book I came across that is a good introduction for the beginning of a course is Applied Multivariate Statistical Analysis, by Hardle and Simar. You can buy the latest version for $61 but really, the introduction from 2003 is still applicable. I was delighted to see someone else start with the same point as I do – descriptive statistics.

Whether you have a multivariate question or univariate one, you should still begin with understanding your data. I really liked the way they used plots to visualize multiple variables. I knocked a few of these out in a minute using SAS Studio.

 
symbol1 v= squarefilled ;
proc gplot data=sashelp.heart ;
plot ageatdeath*agechddiag = sex ;
plot ageatdeath*ageatstart = sex ;
plot ageatdeath*mrw = sex ;
Title “Male ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10) ;
where sex = “Male” ;
run ;
Title “Female ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10);
where sex = “Female” ;

If I had more time, I would show all of these plots in one page – a draftman’s plot - but I’m running out to the airport in a minute. Maybe next time. Yes, I do realize these charts are probably terrible for people who are color-blind. Will work on that at some point also.

You can see that the age at diagnosis and death is linearly related. It seems there are many more males than females and the age at death seems higher for females.

You can see a larger picture here.

age diagnosis by age at deathage at start by age at death
The picture with Metropolitan relative weight did not seem nearly as linear, which makes sense because if you think about it, age at start and age at death HAVE to be related. You cannot be diagnosed at age 50 if you died when you were 30.It also seems as if there is more variance for women than men and the distribution is skewed positively for women.
MRW by age at death by gender

The last two graphs seem to bear that out, ( You can see those here – click and scroll down). which makes me want to do a PROC UNIVARIATE and a t-test. It also makes me wonder if it’s possible that weight could be related to age at death for men but not for women. Or, it could just be that as you get older, you put on weight and older people are more likely to die.

My point is that some simple graphs can allow you to look at your variables in 3 dimensions and then compare the relationships of multiple 3-dimensional graphs.

Gotta run – taking a group of middle school students to a judo tournament in Kansas City. Life is never boring.

I’m pretty certain I did not deliberately hide these folders. When I opened up my new and improved SAS Studio, it had tasks but my programs were missing.

tasks

If this happens to you and you are full of sadness missing your programs, look to the top right of your screen where you see some horizontal lines. Click on those lines.

 

folders with programs

A menu will drop down that has the VIEW option. Select that and select FOLDERS. Now you can view your folders where your programs reside. Based on this, you might think I’m against using the tasks. You’d be wrong. I just like having the option to write SAS code if needed. The tasks are super easy to use and students are going to love this. Check my multiple regression  for an example

 

regression

 

I selected the data set from the SASHELP library, the dependent and independent variables and there you are – ANOVA table, parameter estimates, plot of dependent observed vs predicted  and if you scrolled down – not here because this is just a screen shot, but in SAS Studio, you’d see all of your diagnostic plots. Awesome sauce.

 

 

I’ve been pretty pleased with SAS Studio (the product formerly known as SAS Web Editor), so when Jodi sent me an email with information about using a virtual machine for the multivariate statistics course, I was a bit skeptical. Every time I’ve had to use a remote desktop connection virtual machine for SAS it has been painfully slow. I’ve done it several times but it’s probably been like in 2001, 2003 and 2008 when I was at sites that tried, and generally failed, to use SAS on virtual machines.

Your mileage may vary and here is the danger of testing on a development machine – I have the second-best computer in the office. I have 16GB of RAM and a 3.5 GHz Intel Core i7 processor. Everything from available space (175 GB) to download speed (27Mbps) is probably better than the average student will have.

The previous occasions I was using SAS on a remote virtual machine I had pretty good computers, too,  for the time, but 6 -13 years is pretty dramatic differences in terms of technology.

That being said, the virtual machine offered levels of coolness not available with SAS Studio.

Firstly, size. I did a factor analysis with 206 variables and 51,000 observations because I’m weird like that. I wanted to see what would happen. It extracted 49 factors and performed a varimax rotation in 16.49 seconds. I don’t believe SAS Studio was created with this size of data set in mind.

Secondly, size again. The data sets on the virtual machine added up to several times more than the allowable space for  a course directory in SAS on-demand.
Thirdly, it looked exactly like SAS because it was.

Now, I do realize that the virtual machine with SAS is probably only allowable if your university has a site wide license from SAS.

SAS Studio remains as having the significant advantage of being free and easy.  It also seems to have morphed overnight. I don’t remember these tasks being on the left side, and while they look interesting and useful, they do NOT

  1. Encompass all of the statistics students need to compute in my classes, e.g. , population attributable risk.
  2. Explain where the heck my programs went that I wrote previously. I can still create a new program and save a program and it even shows the folders I had previously as choices to save the new program.

#1 is easily taken care of if I can just find out where the programs are saved, for statistics not available in the task selections, they can just write a program. I’ll look into that this weekend since I have had to get up THREE days this week before 9 a.m. I am thinking I need to get some sleep.

tasks

From my initial takes of the latest versions of each, I think I will:

  • Use SAS Studio for my biostatistics course because it is an easy, basic introduction AND, once I figure out where the programs are hidden, I can have students write some simple programs. (It may be in an obvious place but sleep deprivation does strange things to your brain.)
  • Use the virtual machine for multivariate statistics because it allows for larger data sets and, although I did not have a similar size data set in SAS Studio, I am assuming it will run much faster.

 

The new common core standards have statistics first taught in the sixth grade, or so they say. I disagree with this statement because as I see it, much of the basis of statistics is taught in the earlier grades, although not called by that name.  Here are just a few examples:

  • Bar graphs
  • Line plots
  • X,Y coordinates
  • Fractions and decimals (since the mean is rarely going to be an integer)
  • Ratios and proportions – in summarizing a data set, it’s pretty common to point out, for example, that the ratio of game fish to non-game fish was 3:2. We are often asking if the percentage of something observed is disproportionate to the percentage in the population.

It doesn’t bother me that these topics are not called statistics, I’m just pointing it out. Whether a line is considered a regression line or simply points in two-dimensional space is a matter of context and nothing else.

Speaking of lines and graphs, the very basis of describing a distribution starts with graphing it. So, those second grade bar graphs? Precursors to sixth grade requirements to summarize and described a set of data.

You might say I’m going to an extreme including fractions in there because I may as well throw in addition and division. After all, you need to add up the individual scores and divide by N. Actually, I wouldn’t argue too much with that view.

You can’t even compute a standard deviation without understanding the concepts of squares and square roots, so it would be easy to argue that is at least a prerequisite to statistics.

While I’ve heard a lot of people hating on the common core, personally, I’m interested in seeing how it plays out.

What I expect will continue to happen is that many children will be turned off of math by the third grade because it is generally taught SO abysmally. That isn’t all the fault of teachers – the books they are given to use are often deathly boring. This isn’t to say I am not bothered by the situation. It bothers me a lot.

Working mostly in low-performing schools, I see students who are not very proficient with fractions, proportions, exponents or mathematical notation. We are trying to design our games  to teach all of those prerequisites and then start showing students different distributions, having them collect and interpret data.

Lacking prerequisites is one of the three biggest barriers I see in teaching statistics, or any math, to students. The other two are related; low expectations for what students should be able to learn at each grade, and the fiction on the part of teachers and students that everything should be easy.

People were all up in arms years ago because there was a Barbie doll that said, “Math is hard.”

Guess what? Math is hard sometimes and that is why you have to work hard at it. Even if you really like math and do well at it in school, even if it’s your profession, there are times when you have to spend hours studying it and figuring something out.

Today, I was reviewing textbooks for a course I’ll be teaching on multivariate statistics. I didn’t like any of the three I read for the course, although I found one of them pretty interesting just from a personal perspective. The one I liked had pages after pages of equations in matrix algebra and it would be a definite stretch for most masters students. I’m really debating using it because I know, just like with the middle school students, there will be many lacking prerequisites and it will take a LOT of work on my part to explain vectors, determinants before we can even get to what they are supposed to be learning.

Last week, I had someone seriously ask me if we could make our games “look less like math so that students are learning it without realizing it”. No, we cannot. There’s nothing wrong with learning math that you need to disguise it to look like something else.

Whenever I catch myself thinking in designing a game, “Will the students be able to do X?” and I think they will not because they are lacking the prerequisites, I build an earlier level to teach the prerequisites and go ahead and include X anyway.

Here is why — I’m sitting at the other end teaching graduate students where the text begins like this:

root mean square residual (RMR) For a single-group analysis, the RMR is the root of the mean squared residuals:

\[  \mbox{RMR} = \sqrt {\frac{1}{b} [ \sum _ i^ p \sum _ j^ i (s_{ij} - \hat{\sigma }_{ij})^2 + \delta \sum _ i^ p (\bar{x}_ i - \hat{\mu }_ i)^2 ]}  \]

where

\[  b = \frac{p(p+1+2 \delta )}{2}  \]

is the number of distinct elements in the covariance matrix and in the mean vector (if modeled).

For multiple-group analysis, PROC CALIS uses the following formula for the overall RMR:

\[  \mbox{overall RMR} = \sqrt {\sum _{r=1}^ k \frac{w_ r}{\sum _{r=1}^ k w_ r} [ \sum _ i^ p \sum _ j^ i (s_{ij} - \hat{\sigma }_{ij})^2 + \delta \sum _ i^ p (\bar{x}_ i - \hat{\mu }_ i)^2 ] }  \]
Okay, actually I just pulled that from the SAS PROC CALIS documentation because I was too lazy to copy all of the equations that were in the book I was reading, which went on for pages and pages in this vein, but you get the idea.
Now, these 6th graders are 11 years from being in my course. What I want to know is in what grade do we magically go from having it “not look like math” to reading sentences like,
“The probability distribution (density) of a vector y denoted by f(y) is the same as the joint probability distribution of y1 …. yp . “
or
“It is easy to verify that the correlation coefficient matrix, R, is a symmetric positive definite matrix in which all of the diagonal elements are unity.”
If those two sentences don’t make absolute perfect sense to you, well, you’re fucked because those were the two easiest sentences in the chapter that I just pulled out because I didn’t need to go to the effort of typing in a lot of matrices. I’m teaching at Point B and I want to know how, if students’ Point A is not doing anything that looks like math, they are ever going to get here.
I think the answer is pretty obvious, which is why I’m insisting on teaching every bit of math every chance I can get.

Sometimes the benefits of attending a conference aren’t so much the specific sessions you attend as the ideas they spark. One example was at the Western Users of SAS Software conference last week. I was sitting in a session on PROC PHREG and the presenter was talking about analyzing the covariance matrix when it hit me –

Earlier in the week, Rebecca Ottesen (from Cal Poly) and I had been discussing the limitations of directory size with SAS Studio. You can only have 2 GB of data in a course directory. Well, that’s not very big data, now, is it?

It’s a very reasonable limit for SAS to impose. They can’t go around hosting terabytes of data for each course.

If you, the professor, have a regular SAS license, which many professors do, you can create a covariance matrix for your students to analyze. Even if you include 500 variables, that’s going to be a pretty tiny dataset but it has the data you would need for a lot of analyses – factor analysis, structural equation models, regression.

Creating a covariance data set is a piece of cake. Just do this:

proc corr data=sashelp.heart cov outp=mydata.test2 ;
var ageatdeath ageatstart ageCHDdiag ;

The COV option requests the covariances and the OUTP option has those written to a SAS data set.

If you don’t have access to a high performance computer and have to run the analysis on your desktop, you are going to be somewhat limited, but far less than just using SAS Studio.

So — create a covariance matrix and have them analyze that. Pretty obvious and I don’t know why I haven’t been doing it all along.

What about means, frequencies and chi-square and all that, though?

Well, really, the output from a PROC FREQ can condense your data down dramatically. Say I have 10,000,000 people and I want age at death, blood pressure status, cholesterol status, cause of death and smoking status. I can create an output data set like this. (Not that the heart data set has 10,000,000 records but you get the idea.)

Proc freq data= sashelp.heart ;
Tables AgeAtDeath
*BP_Status
*Chol_Status
*DeathCause
*Sex
*Smoking /noprint out=mydata.test1;

This creates a data set with a count variable, which you can use in your WEIGHT statement in just about any procedure, like

proc means data = test1 ;

weight count ;

var ageatdeath ;

 

Really, you can create “cubes” and analyze your big data on SAS Studio that way.

Yeah, obvious, I know, but I hadn’t been doing it with my students.

The second time I taught statistics, I supplemented the textbook with assignments using real data, and I have been doing it in the twenty-eight years since. The benefits seem so obvious to me that it’s hard to believe that everyone doesn’t do the same. The only explanation I can imagine is that they are not very good instructors or not very confident. You see, the problem with real data is you cannot predict exactly what the problems will be or what you will learn.

For example, the data I was planning on using for an upcoming class came from 8 tables from two different MySQL databases. Four datasets had been read into SAS in the prior year’s analysis and now four new files, exported as csv files were going to be read in.

Easy enough, right? This requires some SET statements and a PROC IMPORT, a MERGE statement and we’re good to go. What could go wrong?

Any time you find yourself asking that question you should do the mad scientist laugh like this – moo wha ha ha .

Here are some things that went wrong -

The PROC IMPORT did not work for some of the datasets. No problem, I replaced that with an INFILE statement and INPUT statement. It’s all good. They learned about FILENAME and file references and how to code an INPUT statement. Of course, being actual data, not all of the variables had the same length or type in every data set, so they learned about an ATTRIB statement to set attributes.

Reading in one data set just would not work, it has some special characters in it, like an obelus (which is the name for the divide symbol  – ÷  now you know). Thanks to Bob Hull and Robert Howard’s PharmaSUG paper, I found the answer.

DATA sl_pre ;

SET mydata.pretest (ENCODING='ASCIIANY');

Every data set had some of the same problems – usernames with data entry errors that were then counted as another user, data from testers mixed in with the subjects. The logical solution was a %INCLUDE of the code to fix this.

In some data sets the grade variable was numeric and in others it was ‘numeric-ish’. I’m copywriting that term, by the way. We’ve all seen numeric-ish data. Grade is supposed to be a number and in 95% of the cases it is but in those other 5% they entered something like 3rd or 5th.  The solution is here:

nugrade=compress(upcase(grade),'ABCDEFGHIJKLMNOPQRSTUVWXYZ ') + 0 ;

and then here

Data allstudentsents ;

set test1 ( rename =(nugrade= grade)) test2  ;

This gives me an opportunity to discuss two functions – COMPRESS and UPCASE, along with data set options in the SET statement.

Kudos to Murphy for a cool paper on the COMPRESS function.

I do start every class with back-of-the-book data because it is an easy introduction and since many students are anxious about statistics, it’s good to start with something simple where everyone can succeed. By the second week, though, we are into real life.

Not everyone teaches with real data because, I think, there are too many adjunct faculty members who get assigned a course the week before it starts and don’t have time to prepare. (I simply won’t teach a course on short notice.) There are too many faculty members who are teaching courses they don’t know well and reading the chapter a week ahead of the students.

Teaching with real, messy data isn’t easy, quick or predictable – which makes it perfect for showing students how statistics and programming really work.

I’m giving a paper on this at WUSS 14 in San Jose in September. If you haven’t registered for the conference, it’s not too late. I’ll post the code examples here this week so if you don’t go you can be depressed about what you are missing,

 

 

 

 

Visual literacy, being the word chooser of this blog, I have decided means the ability to “read” graphic information. A post I saw today on Facebook earnings over time gave a prime example of this.

 

Chart of Facebook earnings by region

If you are a fluent “visualizer”, then just like a fluent reader can read a paragraph and comprehend it, summarize the main points and rephrase it, you could easily grasp the chart above. You would say:

  • Over two years, the number of users from the U.S.  & Canada has grown relatively little.
  • The U.S. / Canadian market was the lowest number of users for the past two years
  • Europe was the next smallest market and grew about 20%  over two years.
  • Asia was the second-largest “market”, second only to “the rest of the world”
  • The U.S/Canada and European markets are shrinking as a percentage of Facebook users

My point isn’t anything about Facebook or Facebook users. I don’t really care. What I do want to point out is that if you are reading this blog, you probably found all of those points so obvious that you wonder why I am even mentioning. Of course, you are reading this blog, so no one needs to explain what those black letters on the screen mean, either.

My point, and I do have one, is that somehow, somewhere, you learned to read graphs like that and that is an important skill. Most likely, you are fluent . That is, many people could perhaps puzzle out what that graph means, just like many people who are not proficient readers can sound out words and kind of figure out the meaning of a paragraph or two. Those people do not generally read War and Peace, or The Definitive Guide to Javascript.

The need for visual literacy is all around you – and that’s my real point.

 

 

 

 

 

When we started the Dakota Learning Project to evaluate our educational games, I wondered if we had bitten off more than we could chew. We proposed to develop the games, pilot them in schools, collect data and analyze the data to see if the games had any impact. We were also going to go back and revise the games based on feedback from the students and teachers.

Some people told us this was far too much and we should just do a qualitative study observing the students playing the game and having them “think aloud”. Another competition we applied to for funding turned us down and one of the reasons they gave is that we were proposing too much.

We ended up doing a mixed methods design, collecting both qualitative and quantitative data and I’m very glad I did not listen to any of these people telling me that it was too much.

There is no substitute for statistics.

When I observed the students in the labs, I thought that perhaps the grade level assigned to specific problems was inconsistent with what the students could really do. For example:

Add and subtract within 1000 … is at the second-grade level 

Multiply one-digit numbers  … is at the third-grade level

It seemed to me that students were having a harder time with the supposedly second-grade problem, but I wasn’t sure if that was really true. Maybe I was seeing the same students miss it over and over. After all, we had 591 students play Spirit Lake in this round of beta testing. It was certainly possible I saw the same students more than once. It is definitely the case that students who were frustrated and just could not get a problem stuck in my mind.

So …. I went back to the data. These data do double-duty because  I’m teaching a statistics class this fall and I am a HUGE advocate of graduate students getting their hands on real data, and here was some actual real data to hand them. (I always analyze the data in advance so it is easy to grade the students’ papers, to give examples in class and so l don’t get student complaining that I am trying to get them to do my work for me, although they still do. Ha! As if.)

We had 1,940 problems answered so, obviously, students answered more than one problem each.  Of those problems, 1,053, or 54.3% were answered on the first attempt. This made me quite happy because it is close to an ideal item difficulty level. Too easy and students get bored. Too hard and they get frustrated.

I used SAS Enterprise guide to produce the chart below:

chart showing subtraction in the middle of difficulty range

You can see that the subtraction problem showed up about mid-range in difficulty. Now, it should be noted that the group gets more selective as you move along. That is, you don’t get to the multiplication problems unless you passed the subtraction problem. Still, it is worth noting that only 70% of fourth- and fifth-grade students in our sample answered correctly on the first try a problem that was supposedly a second-grade question.

Because we want students to start the game succeeding, I added a simpler problem at the beginning. That’s the first bar with 100% of the students answering it correctly. I won’t get too excited about that yet, as I added it later in the study and only a few students were presented that problem. Still, it looks promising.

So, what did I learn that I couldn’t learn without statistics? Well, it reinforced my intuition that the subtraction problem was harder than the multiplication ones and told me that  a substantial proportion of students were failing it on the first try. It was not the same students failing over and over.

The second question then, was whether the instructional materials made any difference. I’m pleased to tell you that they did. On the second (or higher) attempt, 85% of the students answered correctly. If you add the .85 of the 30% who failed the first go-round to the 70% who passed on the first attempt, you get 92% of the students continuing on in the game. This made me happy because it shows that we are beginning at an appropriate level of difficulty. I would have liked 100% but you can’t have everything.

I should note that the questions are NOT multiple choice, and in fact, the answer to that particular problem is 599, so it is not likely the student would have just guessed it on the second attempt.

 

 

More notes from the text mining class. …

This is the article I mentioned in the last post, on Singular Value Decomposition

ftp://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf

Contrary to expectations, I did find time to read it, on the ride back from Las Vegas and it is surprisingly accessible even to people who don’t have a graduate degree in statistics, so I am going to include it in the optional reading for my course.

Many of these concepts like start and stop lists apply to any text mining software but it just happens that the class I’m teaching this fall uses SAS

———
In Enterprise Miner, you can only have 1 project open at a time, but you can have multiple diagrams and libraries, and of course, zillions of nodes, in a single project

In Enterprise Miner, can use text or text location as a type of variable. Documents < 32K in size can be contained in project as a text variable. If greater than 32K, give a text location.

Dictionaries

  • start lists – often used for technical terms
  • stop lists, e.g. articles like “the”, pronouns. These appear with such frequency in documents they don’t contribute to our goal which is to distinguish between documents. May also include words that are high frequency in your particular data. For example, mathematics, in our data, because it is in almost every document we are analyzing

Synonym tables
Multi-word term tables – standard deviation is a multi-word term

Importing a dictionary — go to properties. Click the …. next to the dictionary (start or stop) you want to import. When it comes up with a window, click IMPORT

Select the SAS library you want. Then select the data set you want. If you don’t find the library that you want, try this:

  1. Close your project.
  2. Open it again
  3. Click on the 3 dots next to PROJECT START CODE in the property window
  4. Write a LIBNAME statement that gives the directory where your dictionaries are located.
  5. Open your project again

[Note:  Re-read that last part on start code. This applies to any time you aren't finding the library you are looking for, not just for dictionaries. You can also use start code for any SAS code you want to run at the start of a project. I can see people like myself, who are more familiar with SAS code than Enterprise Miner, using that a lot.]

Filter viewer – can specify minimum number of documents for term inclusion

 

—————-

Jenn and ChrisSpeaking of Las Vegas, blogging has been a little slow lately since we took off to watch The Perfect Jennifer get married. It was a very small wedding, officiated by Hawaiian Elvis. Darling Daughter Number Three doubled as bartender and bridesmaid then stayed in Las Vegas because she has a world title fight in a few days.

Given the time crunch, I was particularly glad I’d attended this course that gave me the opportunity to draft at least one week’s worth of lectures in the fall. When I finish these notes, my plan is to to edit them and turn it into the last lecture in the data mining course. If it’s helpful to you, feel free to use whatever you like. I’ll try to remember to post a more final version in the fall. If you have teaching resources for data mining yourself, please let me know.

My crazy schedule is the reason I start everything FAR ahead of time.

 

Next Page →