When I was running out to the airport, I said I would explain how to get a single plot of all of your scatter plots with SAS. Let’s see if I can get this posted before the plane lands. Not likely but you never know …
It’s super simple
proc sgscatter data=sashelp.heart ;
matrix ageatdeath ageatstart agechddiag mrw / group= sex ;
And there you go.
Statistical graphics from 10,000 feet. Is this a full service blog or what?
My life is upside down. All day, as my job, I spent writing a program to get a little man to run around a maze, come out the other end and have a new screen come up with a math challenge question. Then, in the evening, I’m surfing the web for interesting bits to read on multivariate statistics.
I’m teaching a course this winter and could not find the Goldilocks textbook, you know, the one that is just right. They either had no details – just pointing and clicking to get results – or the wrong kind of details. One book had pages of equations, then code for producing some output with very little explanation and no interpretation of the output.
I finally did select a textbook but it was a little less thorough than I would like in some places. I decided to supplement it with required and optional readings from other sources. Thus, the websurfing.
One book I came across that is a good introduction for the beginning of a course is Applied Multivariate Statistical Analysis, by Hardle and Simar. You can buy the latest version for $61 but really, the introduction from 2003 is still applicable. I was delighted to see someone else start with the same point as I do – descriptive statistics.
Whether you have a multivariate question or univariate one, you should still begin with understanding your data. I really liked the way they used plots to visualize multiple variables. I knocked a few of these out in a minute using SAS Studio.
symbol1 v= squarefilled ;
proc gplot data=sashelp.heart ;
plot ageatdeath*agechddiag = sex ;
plot ageatdeath*ageatstart = sex ;
plot ageatdeath*mrw = sex ;
Title “Male ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10) ;
where sex = “Male” ;
Title “Female ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10);
where sex = “Female” ;
If I had more time, I would show all of these plots in one page – a draftman’s plot - but I’m running out to the airport in a minute. Maybe next time. Yes, I do realize these charts are probably terrible for people who are color-blind. Will work on that at some point also.
You can see that the age at diagnosis and death is linearly related. It seems there are many more males than females and the age at death seems higher for females.
The picture with Metropolitan relative weight did not seem nearly as linear, which makes sense because if you think about it, age at start and age at death HAVE to be related. You cannot be diagnosed at age 50 if you died when you were 30.It also seems as if there is more variance for women than men and the distribution is skewed positively for women.
The last two graphs seem to bear that out, ( You can see those here – click and scroll down). which makes me want to do a PROC UNIVARIATE and a t-test. It also makes me wonder if it’s possible that weight could be related to age at death for men but not for women. Or, it could just be that as you get older, you put on weight and older people are more likely to die.
My point is that some simple graphs can allow you to look at your variables in 3 dimensions and then compare the relationships of multiple 3-dimensional graphs.
Gotta run – taking a group of middle school students to a judo tournament in Kansas City. Life is never boring.
I’m pretty certain I did not deliberately hide these folders. When I opened up my new and improved SAS Studio, it had tasks but my programs were missing.
If this happens to you and you are full of sadness missing your programs, look to the top right of your screen where you see some horizontal lines. Click on those lines.
A menu will drop down that has the VIEW option. Select that and select FOLDERS. Now you can view your folders where your programs reside. Based on this, you might think I’m against using the tasks. You’d be wrong. I just like having the option to write SAS code if needed. The tasks are super easy to use and students are going to love this. Check my multiple regression for an example
I selected the data set from the SASHELP library, the dependent and independent variables and there you are – ANOVA table, parameter estimates, plot of dependent observed vs predicted and if you scrolled down – not here because this is just a screen shot, but in SAS Studio, you’d see all of your diagnostic plots. Awesome sauce.
I’ve been pretty pleased with SAS Studio (the product formerly known as SAS Web Editor), so when Jodi sent me an email with information about using a virtual machine for the multivariate statistics course, I was a bit skeptical. Every time I’ve had to use a remote desktop connection virtual machine for SAS it has been painfully slow. I’ve done it several times but it’s probably been like in 2001, 2003 and 2008 when I was at sites that tried, and generally failed, to use SAS on virtual machines.
Your mileage may vary and here is the danger of testing on a development machine – I have the second-best computer in the office. I have 16GB of RAM and a 3.5 GHz Intel Core i7 processor. Everything from available space (175 GB) to download speed (27Mbps) is probably better than the average student will have.
The previous occasions I was using SAS on a remote virtual machine I had pretty good computers, too, for the time, but 6 -13 years is pretty dramatic differences in terms of technology.
That being said, the virtual machine offered levels of coolness not available with SAS Studio.
Firstly, size. I did a factor analysis with 206 variables and 51,000 observations because I’m weird like that. I wanted to see what would happen. It extracted 49 factors and performed a varimax rotation in 16.49 seconds. I don’t believe SAS Studio was created with this size of data set in mind.
Secondly, size again. The data sets on the virtual machine added up to several times more than the allowable space for a course directory in SAS on-demand.
Thirdly, it looked exactly like SAS because it was.
Now, I do realize that the virtual machine with SAS is probably only allowable if your university has a site wide license from SAS.
SAS Studio remains as having the significant advantage of being free and easy. It also seems to have morphed overnight. I don’t remember these tasks being on the left side, and while they look interesting and useful, they do NOT
- Encompass all of the statistics students need to compute in my classes, e.g. , population attributable risk.
- Explain where the heck my programs went that I wrote previously. I can still create a new program and save a program and it even shows the folders I had previously as choices to save the new program.
#1 is easily taken care of if I can just find out where the programs are saved, for statistics not available in the task selections, they can just write a program. I’ll look into that this weekend since I have had to get up THREE days this week before 9 a.m. I am thinking I need to get some sleep.
From my initial takes of the latest versions of each, I think I will:
- Use SAS Studio for my biostatistics course because it is an easy, basic introduction AND, once I figure out where the programs are hidden, I can have students write some simple programs. (It may be in an obvious place but sleep deprivation does strange things to your brain.)
- Use the virtual machine for multivariate statistics because it allows for larger data sets and, although I did not have a similar size data set in SAS Studio, I am assuming it will run much faster.
Sometimes the benefits of attending a conference aren’t so much the specific sessions you attend as the ideas they spark. One example was at the Western Users of SAS Software conference last week. I was sitting in a session on PROC PHREG and the presenter was talking about analyzing the covariance matrix when it hit me –
Earlier in the week, Rebecca Ottesen (from Cal Poly) and I had been discussing the limitations of directory size with SAS Studio. You can only have 2 GB of data in a course directory. Well, that’s not very big data, now, is it?
It’s a very reasonable limit for SAS to impose. They can’t go around hosting terabytes of data for each course.
If you, the professor, have a regular SAS license, which many professors do, you can create a covariance matrix for your students to analyze. Even if you include 500 variables, that’s going to be a pretty tiny dataset but it has the data you would need for a lot of analyses – factor analysis, structural equation models, regression.
Creating a covariance data set is a piece of cake. Just do this:
proc corr data=sashelp.heart cov outp=mydata.test2 ;
var ageatdeath ageatstart ageCHDdiag ;
The COV option requests the covariances and the OUTP option has those written to a SAS data set.
If you don’t have access to a high performance computer and have to run the analysis on your desktop, you are going to be somewhat limited, but far less than just using SAS Studio.
So — create a covariance matrix and have them analyze that. Pretty obvious and I don’t know why I haven’t been doing it all along.
What about means, frequencies and chi-square and all that, though?
Well, really, the output from a PROC FREQ can condense your data down dramatically. Say I have 10,000,000 people and I want age at death, blood pressure status, cholesterol status, cause of death and smoking status. I can create an output data set like this. (Not that the heart data set has 10,000,000 records but you get the idea.)
Proc freq data= sashelp.heart ;
*Smoking /noprint out=mydata.test1;
This creates a data set with a count variable, which you can use in your WEIGHT statement in just about any procedure, like
proc means data = test1 ;
weight count ;
var ageatdeath ;
Really, you can create “cubes” and analyze your big data on SAS Studio that way.
Yeah, obvious, I know, but I hadn’t been doing it with my students.
I’m just heading off to the Western Users of SAS Software meeting that starts tomorrow. After the keynote, during which I have promised not to swear even once, I’m doing a SAS Essentials talk on Thursday, where I teach students 10 basic steps that allow them to complete an entire annual report project.
One of these is PROC DATASETS. It is used twice in the project. First, they get a list of all of the datasets in the directory. We’re using SAS Studio which runs on the SAS server. Since students neither have access to issue Unix commands directly nor do they know any most likely, we use PROC DATASETS.
libname mydata "/courses/u_mine.edu1/i_1234/c_7890/wuss14/";
proc datasets library= mydata ;
This gives me the output below.
|#||Name||Member Type||File Size||Last Modified|
Once we have cleaned up the data in every data set, we are not quite ready to start merging them together. A common problem is that data sets have different names, lengths or types for the same variable. You’d be wise to check the variable names, types and lengths of all the variables. So, here is where we use PROC DATASETS a second time.
proc datasets library= work ;
contents data = _all_ ;
This time, we added another statement. The “contents data = _all_ “ will print the contents of all of the data sets. In perusing the contents, I see that grade is entered as character data in one – 5th, 4th and so on, while it is numeric data in another. This is the sort of thing you never run into in “back of the textbook” data, but that shows up often in real life.
Those are two super simple steps that allow you to do useful things.
You can do more with PROC DATASETS – append, compare – but my plane is boarding so more about that some other time.
Lately, I’ve been working on a report that uses eight datasets that all have the same problems with the usernames.
In addition to needing to remove every username that contained the word “test” or “intern” we also needed to delete specific names of the classroom teachers who had played the game. We needed to correct names that were misspelled.
Here are a few examples of a very long list of statements:
if username in("MSDMARIA","1ANSWERKEY","MSDELAPAZ","MSCARRINGTON") then delete ;
if username = “GRETBUFFALO” then username = “GREYBUFFALO” ;
else if username = “HALFHORES” then username = “HALFHORSE” ;
else if username =”TTCARLSON18TTCARLSON18″ then username = “TTCARLSON18″ ;
These problems occurred in every dataset.
A second problem found when looking at the contents of each of the 8 datasets was that the username variable was not the same length in all of them, which could cause problems later when they were to be merged together or concatenated. Also, now that all of the usernames have been cleaned up, none should be over 12 characters in length.
Wouldn’t it be nice if there was a way to just get the first n characters of a string?
Enter our character function, substr, which returns a substring of a variable beginning at any position and for as many characters as you like. Problem solved.
newid = substr(username, 1, 12) ;
It seems pretty inefficient to write this set of statements eight times in eight different data sets. Also, next year we will have another eight data sets, and some will have these same students’ usernames and same problems. Wouldn’t it be a lot easier to have these statements in one place and add to the “fixnames.sas” file whenever we find a new problem?
So, now we have the write once, use anywhere solution of %INCLUDE.
What %INCLUDE does
The %INCLUDE statement references lines from an external file and processes them immediately. It has almost the exact same effect as if you had copied and pasted those lines write into your program. The exception that makes it “almost” is that a %INCLUDE statement must begin at a statement boundary. That is, it has to be either the first statement in your program or occur after a semi-colon ending a statement as in this example.
data studentsf ;
infile inf delimiter = “,” missover ;
attrib teacher length = $12. username length = $ 16. ;
input username $ age sex $ grade school $ teacher $ ;
%include “/courses/u_mine.edu1/wuss14/fixnames.sas” ;
Also, you need to think about it as if you had copied and pasted those lines into your program. Is it still valid code? Whenever using %INCLUDE, you should make sure the code runs in your program as expected, with no errors, before cutting it out and making it an external file.
To source or not to source
The default is not to show the statements that were in the included file. Generally, this is desirable. This is code you have already debugged and if you are using it multiple times (otherwise, why bother with the %INCLUDE), having the same 20 lines repeated 8 times in your log just makes it harder to debug.
Professors might want to use real data but hide all of the messy data handling from the students initially in fear they would run screaming for the door. I meant, professors might want to gradually introduce SAS statements and functions for data handling.
In either case, students could use the %INCLUDE statement as shown in the example above. To see the code include in your log is quite simple, just add a source2 option as shown.
%include “/courses/u_mine.edu1/wuss14/fixnames.sas” /source2 ;
It will be in your log as follows
NOTE: %INCLUDE (level 1) file “/courses/u_mine.edu1/wuss14/fixnames.sas is file “/courses/u_mine.edu1/wuss14/fixnames.sas.
419 +username = compress(upcase(username),”. “) ;
420 +if (index(username,”TEST”) > 0 or index(username,”INTERN”) > 0
and so on.
The + signs next to each statement denote it was in the included file.
If you want to know why I think it is so important for new SAS users to learn about the %INCLUDE statement, you should come to the Western Users of SAS Software conference in San Jose next month. Especially if you’re a student, you should come, because they cut you a really good deal.
If you’re not a student and you have a real professional job – well, then, you should be able to afford it. There will be funny hats, beer, coding and cookies. What more could one ask?
A few years ago, I was at the Western Uses of SAS Software conference and renowned statistician Stanley Azen played the piano and sang at the closing ceremony.
Briefly, very briefly, I considered beginning my presentation on 10 SAS steps to an annual report, by writing a song, These are a few of my favorite PROCs, and then singing it to the tune of “These are a few of my favorite things.”
This plan was dismissed a nanosecond later when I was reminded by The Perfect Jennifer that my singing bears an uncanny resemblance to the sound Beijing the cat used to make in the middle of the night when fighting with the cat next door.
To make up for my disappointment over my lack of musical rendition, I decided to do a few posts on my favorite PROCS, in no particular order. Today’s contestant is … drum roll please ….
Whenever possible, I try reading in the data using the IMPORT procedure, because, it is very simple and my goal in programming is not impress people with my brilliance – it is to get the job done with maximum efficiency and minimum effort.
As can be seen in the example below, there is no need to declare variable lengths, type or names. Only three statements are required.
PROC IMPORT OUT= work.studentsf DATAFILE= "/courses/u_mine.edu1/wuss14/Fish_students.csv" DBMS = CSV REPLACE;
GETNAMES = YES ;
This PROC IMPORT statement gives the location of the data file, specifies that its format is csv (comma separate values), the output file name is studentsf, in the work directory and that if the file specified in the OUT= option already exists, I want it to be replaced.
The second statement will cause SAS to get the variable names from the first row in the file. Since the variable names are in the first row of the file, the data begins in row 2.
Limitations of PROC IMPORT
As handy as it can be, PROC IMPORT has its limitations. Three we ran into in this project are:
- Excel files cannot be uploaded via FTP to the SAS server, so , no PROC IMPORT with Excel if you are using the SAS Web Editor,
- If the data that you want to import is a type that SAS does not support, PROC IMPORT attempts to convert the data, but that does not always work.
- For delimited files, the first 20 rows are used to determine the variable attributes. You can give a higher value for the number of rows scanned using the GUESSINGROWS statement, but you may have no idea what that higher value should be. For example, the first 300 rows may all have numbers and then the class that was records 301-324 has entered their grade as “4th” instead of the number 4.
Although PROC IMPORT is the first thing I always try, one of my pet peeves about instructors and textbooks is when it is the only thing they teach. It’s smart to try the simplest solution first. It’s dumb not to have a back up plan for the instances when that doesn’t work.
For more on what to do in those cases, you can come to WUSS in San Jose. Just a reminder – regular registration closes August 25. After that date, you’ll have to register on site.
Finishing up my second paper for WUSS next month and I have been thinking about the usefulness of character functions in a world where it sometimes seems like everyone is just put on this earth to irritate the hell out of me.
Take this problem, for example,
In analyzing the data for our games, we have all sorts of beta testers – teachers, staff, interns – who played the game but their data should be deleted from the datasets for the annual report. We asked them to use the word TEST in their username so it would be easy to pull them from the data. Some of them did and some apparently feel that I just say these things of exercise for my mouth.
There is also a problem with data entry errors. The subjects in this study were children in grades three through six and they frequently mistyped their usernames.
SAS has a wealth of character functions and this is a first opportunity to get to know and love four of them.
The UPCASE function, not surprisingly, changes the value of a variable to upper case. The COMPRESS function, if you give it only the variable as an argument, will remove blanks from a value. You can, however, include additional characters to remove. Since many of the students entered their names on some days as JohnDoe and others as John.Doe , we are removing both blanks and periods using the COMPRESS function, after we have converted them to upper case.
username = COMPRESS(UPCASE(username),'. ') ;
Then there is the INDEX function. Here is a general tip. Any time you find yourself thinking,
“Gee it would be nice if SAS did thing X”,
it is a pretty good bet that someone else thought the same idea and there is a function for it. The INDEX function is a perfect example of that. Our testers played the games many, many times and used usernames like “tester1”, “this.test”, “skippy the tester” or “intern7”.
“Wouldn’t it be nice if there was way to find out whether a given string appeared anywhere in a value?”
Enter the INDEX function, which does exactly that. This function is case-sensitive, but since we already converted the username to upper case above, that is no problem for us.
IF INDEX(username, “TEST”) > 0 or INDEX(username,”INTERN”) > 0 THEN DELETE ;
will do exactly what we want. The INDEX function returns a number that is the starting position in the string of the substring we are trying to find. So, in “skippy the tester”, the value is 12, in “tester1” it is 1. If the string is not found, the value is 0.
A problem I found when looking at the contents of each of the 8 datasets used for my research project was that the username variable was not the same length in all of them, which could cause problems later when they were to be merged together or concatenated. All of the usernames should have been a maximum of 12 characters but there were data entry problems when students would type mister_rogers instead of mr_rogers.
When the data are read in using PROC IMPORT, “For delimited files, the first 20 rows are scanned to determine the variable attributes. You can increase the number of rows scanned by using the GUESSINGROWS data source statement.”
Wouldn’t it be nice if there was a way to just get the first n characters of a string?
newid = SUBSTR(username, 1, 12) ;
will create a new variable with the first 12 characters of the username, now that we have gone and fixed the problems with it.
SAS is chock full of functions and options to make your life easier. If you are just beginning to work with SAS and you spend time working with messy data, you probably couldn’t spend your time much better than taking a few hours to read up on SAS character functions. In fact, I think for someone new to SAS, becoming familiar with a large number of all types of functions – character, statistical, date and time – is probably the fastest way to improve one’s productivity. (Ron Cody’s book, SAS Functions by Example, is a great resource). I’ve lost count of the number of times when reviewing a student’s program I’ve seen many lines of completely unnecessary code that could have been replaced by a SAS function – if the student only knew that it existed.
The second time I taught statistics, I supplemented the textbook with assignments using real data, and I have been doing it in the twenty-eight years since. The benefits seem so obvious to me that it’s hard to believe that everyone doesn’t do the same. The only explanation I can imagine is that they are not very good instructors or not very confident. You see, the problem with real data is you cannot predict exactly what the problems will be or what you will learn.
For example, the data I was planning on using for an upcoming class came from 8 tables from two different MySQL databases. Four datasets had been read into SAS in the prior year’s analysis and now four new files, exported as csv files were going to be read in.
Easy enough, right? This requires some SET statements and a PROC IMPORT, a MERGE statement and we’re good to go. What could go wrong?
Any time you find yourself asking that question you should do the mad scientist laugh like this – moo wha ha ha .
Here are some things that went wrong -
The PROC IMPORT did not work for some of the datasets. No problem, I replaced that with an INFILE statement and INPUT statement. It’s all good. They learned about FILENAME and file references and how to code an INPUT statement. Of course, being actual data, not all of the variables had the same length or type in every data set, so they learned about an ATTRIB statement to set attributes.
Reading in one data set just would not work, it has some special characters in it, like an obelus (which is the name for the divide symbol – ÷ now you know). Thanks to Bob Hull and Robert Howard’s PharmaSUG paper, I found the answer.
DATA sl_pre ;
SET mydata.pretest (ENCODING='ASCIIANY');
Every data set had some of the same problems – usernames with data entry errors that were then counted as another user, data from testers mixed in with the subjects. The logical solution was a %INCLUDE of the code to fix this.
In some data sets the grade variable was numeric and in others it was ‘numeric-ish’. I’m copywriting that term, by the way. We’ve all seen numeric-ish data. Grade is supposed to be a number and in 95% of the cases it is but in those other 5% they entered something like 3rd or 5th. The solution is here:
nugrade=compress(upcase(grade),'ABCDEFGHIJKLMNOPQRSTUVWXYZ ') + 0 ;
and then here
Data allstudentsents ;
set test1 ( rename =(nugrade= grade)) test2 ;
This gives me an opportunity to discuss two functions – COMPRESS and UPCASE, along with data set options in the SET statement.
I do start every class with back-of-the-book data because it is an easy introduction and since many students are anxious about statistics, it’s good to start with something simple where everyone can succeed. By the second week, though, we are into real life.
Not everyone teaches with real data because, I think, there are too many adjunct faculty members who get assigned a course the week before it starts and don’t have time to prepare. (I simply won’t teach a course on short notice.) There are too many faculty members who are teaching courses they don’t know well and reading the chapter a week ahead of the students.
Teaching with real, messy data isn’t easy, quick or predictable – which makes it perfect for showing students how statistics and programming really work.
I’m giving a paper on this at WUSS 14 in San Jose in September. If you haven’t registered for the conference, it’s not too late. I’ll post the code examples here this week so if you don’t go you can be depressed about what you are missing,