# Interestingness from WUSS: Part 2 Condensing Big Data

Sometimes the benefits of attending a conference aren’t so much the specific sessions you attend as the ideas they spark. One example was at the Western Users of SAS Software conference last week. I was sitting in a session on PROC PHREG and the presenter was talking about analyzing the covariance matrix when it hit me —

Earlier in the week, Rebecca Ottesen (from Cal Poly) and I had been discussing the limitations of directory size with SAS Studio. You can only have 2 GB of data in a course directory. Well, that’s not very big data, now, is it?

It’s a very reasonable limit for SAS to impose. They can’t go around hosting terabytes of data for each course.

If you, the professor, have a regular SAS license, which many professors do, you can create a covariance matrix for your students to analyze. Even if you include 500 variables, that’s going to be a pretty tiny dataset but it has the data you would need for a lot of analyses – factor analysis, structural equation models, regression.

Creating a covariance data set is a piece of cake. Just do this:

proc corr data=sashelp.heart cov outp=mydata.test2 ;

var ageatdeath ageatstart ageCHDdiag ;

The COV option requests the covariances and the OUTP option has those written to a SAS data set.

If you don’t have access to a high performance computer and have to run the analysis on your desktop, you are going to be somewhat limited, but far less than just using SAS Studio.

So — create a covariance matrix and have them analyze that. Pretty obvious and I don’t know why I haven’t been doing it all along.

What about means, frequencies and chi-square and all that, though?

Well, really, the output from a PROC FREQ can condense your data down dramatically. Say I have 10,000,000 people and I want age at death, blood pressure status, cholesterol status, cause of death and smoking status. I can create an output data set like this. (Not that the heart data set has 10,000,000 records but you get the idea.)

Proc freq data= sashelp.heart ;

Tables AgeAtDeath

*BP_Status

*Chol_Status

*DeathCause

*Sex

*Smoking /noprint out=mydata.test1;

This creates a data set with a count variable, which you can use in your WEIGHT statement in just about any procedure, like

proc means data = test1 ;

weight count ;

var ageatdeath ;

Really, you can create “cubes” and analyze your big data on SAS Studio that way.

Yeah, obvious, I know, but I hadn’t been doing it with my students.