I’ve been pretty pleased with SAS Studio (the product formerly known as SAS Web Editor), so when Jodi sent me an email with information about using a virtual machine for the multivariate statistics course, I was a bit skeptical. Every time I’ve had to use a remote desktop connection virtual machine for SAS it has been painfully slow. I’ve done it several times but it’s probably been like in 2001, 2003 and 2008 when I was at sites that tried, and generally failed, to use SAS on virtual machines.
Your mileage may vary and here is the danger of testing on a development machine – I have the second-best computer in the office. I have 16GB of RAM and a 3.5 GHz Intel Core i7 processor. Everything from available space (175 GB) to download speed (27Mbps) is probably better than the average student will have.
The previous occasions I was using SAS on a remote virtual machine I had pretty good computers, too, for the time, but 6 -13 years is pretty dramatic differences in terms of technology.
That being said, the virtual machine offered levels of coolness not available with SAS Studio.
Firstly, size. I did a factor analysis with 206 variables and 51,000 observations because I’m weird like that. I wanted to see what would happen. It extracted 49 factors and performed a varimax rotation in 16.49 seconds. I don’t believe SAS Studio was created with this size of data set in mind.
Secondly, size again. The data sets on the virtual machine added up to several times more than the allowable space for a course directory in SAS on-demand.
Thirdly, it looked exactly like SAS because it was.
Now, I do realize that the virtual machine with SAS is probably only allowable if your university has a site wide license from SAS.
SAS Studio remains as having the significant advantage of being free and easy. It also seems to have morphed overnight. I don’t remember these tasks being on the left side, and while they look interesting and useful, they do NOT
- Encompass all of the statistics students need to compute in my classes, e.g. , population attributable risk.
- Explain where the heck my programs went that I wrote previously. I can still create a new program and save a program and it even shows the folders I had previously as choices to save the new program.
#1 is easily taken care of if I can just find out where the programs are saved, for statistics not available in the task selections, they can just write a program. I’ll look into that this weekend since I have had to get up THREE days this week before 9 a.m. I am thinking I need to get some sleep.
From my initial takes of the latest versions of each, I think I will:
- Use SAS Studio for my biostatistics course because it is an easy, basic introduction AND, once I figure out where the programs are hidden, I can have students write some simple programs. (It may be in an obvious place but sleep deprivation does strange things to your brain.)
- Use the virtual machine for multivariate statistics because it allows for larger data sets and, although I did not have a similar size data set in SAS Studio, I am assuming it will run much faster.
The new common core standards have statistics first taught in the sixth grade, or so they say. I disagree with this statement because as I see it, much of the basis of statistics is taught in the earlier grades, although not called by that name. Here are just a few examples:
- Bar graphs
- Line plots
- X,Y coordinates
- Fractions and decimals (since the mean is rarely going to be an integer)
- Ratios and proportions – in summarizing a data set, it’s pretty common to point out, for example, that the ratio of game fish to non-game fish was 3:2. We are often asking if the percentage of something observed is disproportionate to the percentage in the population.
It doesn’t bother me that these topics are not called statistics, I’m just pointing it out. Whether a line is considered a regression line or simply points in two-dimensional space is a matter of context and nothing else.
Speaking of lines and graphs, the very basis of describing a distribution starts with graphing it. So, those second grade bar graphs? Precursors to sixth grade requirements to summarize and described a set of data.
You might say I’m going to an extreme including fractions in there because I may as well throw in addition and division. After all, you need to add up the individual scores and divide by N. Actually, I wouldn’t argue too much with that view.
You can’t even compute a standard deviation without understanding the concepts of squares and square roots, so it would be easy to argue that is at least a prerequisite to statistics.
While I’ve heard a lot of people hating on the common core, personally, I’m interested in seeing how it plays out.
What I expect will continue to happen is that many children will be turned off of math by the third grade because it is generally taught SO abysmally. That isn’t all the fault of teachers – the books they are given to use are often deathly boring. This isn’t to say I am not bothered by the situation. It bothers me a lot.
Working mostly in low-performing schools, I see students who are not very proficient with fractions, proportions, exponents or mathematical notation. We are trying to design our games to teach all of those prerequisites and then start showing students different distributions, having them collect and interpret data.
Lacking prerequisites is one of the three biggest barriers I see in teaching statistics, or any math, to students. The other two are related; low expectations for what students should be able to learn at each grade, and the fiction on the part of teachers and students that everything should be easy.
People were all up in arms years ago because there was a Barbie doll that said, “Math is hard.”
Guess what? Math is hard sometimes and that is why you have to work hard at it. Even if you really like math and do well at it in school, even if it’s your profession, there are times when you have to spend hours studying it and figuring something out.
Today, I was reviewing textbooks for a course I’ll be teaching on multivariate statistics. I didn’t like any of the three I read for the course, although I found one of them pretty interesting just from a personal perspective. The one I liked had pages after pages of equations in matrix algebra and it would be a definite stretch for most masters students. I’m really debating using it because I know, just like with the middle school students, there will be many lacking prerequisites and it will take a LOT of work on my part to explain vectors, determinants before we can even get to what they are supposed to be learning.
Last week, I had someone seriously ask me if we could make our games “look less like math so that students are learning it without realizing it”. No, we cannot. There’s nothing wrong with learning math that you need to disguise it to look like something else.
Whenever I catch myself thinking in designing a game, “Will the students be able to do X?” and I think they will not because they are lacking the prerequisites, I build an earlier level to teach the prerequisites and go ahead and include X anyway.
Here is why — I’m sitting at the other end teaching graduate students where the text begins like this:
root mean square residual (RMR) For a single-group analysis, the RMR is the root of the mean squared residuals:
is the number of distinct elements in the covariance matrix and in the mean vector (if modeled).
For multiple-group analysis, PROC CALIS uses the following formula for the overall RMR:
“The probability distribution (density) of a vector y denoted by f(y) is the same as the joint probability distribution of y1 …. yp . “
“It is easy to verify that the correlation coefficient matrix, R, is a symmetric positive definite matrix in which all of the diagonal elements are unity.”
Sometimes the benefits of attending a conference aren’t so much the specific sessions you attend as the ideas they spark. One example was at the Western Users of SAS Software conference last week. I was sitting in a session on PROC PHREG and the presenter was talking about analyzing the covariance matrix when it hit me –
Earlier in the week, Rebecca Ottesen (from Cal Poly) and I had been discussing the limitations of directory size with SAS Studio. You can only have 2 GB of data in a course directory. Well, that’s not very big data, now, is it?
It’s a very reasonable limit for SAS to impose. They can’t go around hosting terabytes of data for each course.
If you, the professor, have a regular SAS license, which many professors do, you can create a covariance matrix for your students to analyze. Even if you include 500 variables, that’s going to be a pretty tiny dataset but it has the data you would need for a lot of analyses – factor analysis, structural equation models, regression.
Creating a covariance data set is a piece of cake. Just do this:
proc corr data=sashelp.heart cov outp=mydata.test2 ;
var ageatdeath ageatstart ageCHDdiag ;
The COV option requests the covariances and the OUTP option has those written to a SAS data set.
If you don’t have access to a high performance computer and have to run the analysis on your desktop, you are going to be somewhat limited, but far less than just using SAS Studio.
So — create a covariance matrix and have them analyze that. Pretty obvious and I don’t know why I haven’t been doing it all along.
What about means, frequencies and chi-square and all that, though?
Well, really, the output from a PROC FREQ can condense your data down dramatically. Say I have 10,000,000 people and I want age at death, blood pressure status, cholesterol status, cause of death and smoking status. I can create an output data set like this. (Not that the heart data set has 10,000,000 records but you get the idea.)
Proc freq data= sashelp.heart ;
*Smoking /noprint out=mydata.test1;
This creates a data set with a count variable, which you can use in your WEIGHT statement in just about any procedure, like
proc means data = test1 ;
weight count ;
var ageatdeath ;
Really, you can create “cubes” and analyze your big data on SAS Studio that way.
Yeah, obvious, I know, but I hadn’t been doing it with my students.
The second time I taught statistics, I supplemented the textbook with assignments using real data, and I have been doing it in the twenty-eight years since. The benefits seem so obvious to me that it’s hard to believe that everyone doesn’t do the same. The only explanation I can imagine is that they are not very good instructors or not very confident. You see, the problem with real data is you cannot predict exactly what the problems will be or what you will learn.
For example, the data I was planning on using for an upcoming class came from 8 tables from two different MySQL databases. Four datasets had been read into SAS in the prior year’s analysis and now four new files, exported as csv files were going to be read in.
Easy enough, right? This requires some SET statements and a PROC IMPORT, a MERGE statement and we’re good to go. What could go wrong?
Any time you find yourself asking that question you should do the mad scientist laugh like this – moo wha ha ha .
Here are some things that went wrong -
The PROC IMPORT did not work for some of the datasets. No problem, I replaced that with an INFILE statement and INPUT statement. It’s all good. They learned about FILENAME and file references and how to code an INPUT statement. Of course, being actual data, not all of the variables had the same length or type in every data set, so they learned about an ATTRIB statement to set attributes.
Reading in one data set just would not work, it has some special characters in it, like an obelus (which is the name for the divide symbol – ÷ now you know). Thanks to Bob Hull and Robert Howard’s PharmaSUG paper, I found the answer.
DATA sl_pre ;
SET mydata.pretest (ENCODING='ASCIIANY');
Every data set had some of the same problems – usernames with data entry errors that were then counted as another user, data from testers mixed in with the subjects. The logical solution was a %INCLUDE of the code to fix this.
In some data sets the grade variable was numeric and in others it was ‘numeric-ish’. I’m copywriting that term, by the way. We’ve all seen numeric-ish data. Grade is supposed to be a number and in 95% of the cases it is but in those other 5% they entered something like 3rd or 5th. The solution is here:
nugrade=compress(upcase(grade),'ABCDEFGHIJKLMNOPQRSTUVWXYZ ') + 0 ;
and then here
Data allstudentsents ;
set test1 ( rename =(nugrade= grade)) test2 ;
This gives me an opportunity to discuss two functions – COMPRESS and UPCASE, along with data set options in the SET statement.
I do start every class with back-of-the-book data because it is an easy introduction and since many students are anxious about statistics, it’s good to start with something simple where everyone can succeed. By the second week, though, we are into real life.
Not everyone teaches with real data because, I think, there are too many adjunct faculty members who get assigned a course the week before it starts and don’t have time to prepare. (I simply won’t teach a course on short notice.) There are too many faculty members who are teaching courses they don’t know well and reading the chapter a week ahead of the students.
Teaching with real, messy data isn’t easy, quick or predictable – which makes it perfect for showing students how statistics and programming really work.
I’m giving a paper on this at WUSS 14 in San Jose in September. If you haven’t registered for the conference, it’s not too late. I’ll post the code examples here this week so if you don’t go you can be depressed about what you are missing,
Visual literacy, being the word chooser of this blog, I have decided means the ability to “read” graphic information. A post I saw today on Facebook earnings over time gave a prime example of this.
If you are a fluent “visualizer”, then just like a fluent reader can read a paragraph and comprehend it, summarize the main points and rephrase it, you could easily grasp the chart above. You would say:
- Over two years, the number of users from the U.S. & Canada has grown relatively little.
- The U.S. / Canadian market was the lowest number of users for the past two years
- Europe was the next smallest market and grew about 20% over two years.
- Asia was the second-largest “market”, second only to “the rest of the world”
- The U.S/Canada and European markets are shrinking as a percentage of Facebook users
My point isn’t anything about Facebook or Facebook users. I don’t really care. What I do want to point out is that if you are reading this blog, you probably found all of those points so obvious that you wonder why I am even mentioning. Of course, you are reading this blog, so no one needs to explain what those black letters on the screen mean, either.
The need for visual literacy is all around you – and that’s my real point.
When we started the Dakota Learning Project to evaluate our educational games, I wondered if we had bitten off more than we could chew. We proposed to develop the games, pilot them in schools, collect data and analyze the data to see if the games had any impact. We were also going to go back and revise the games based on feedback from the students and teachers.
Some people told us this was far too much and we should just do a qualitative study observing the students playing the game and having them “think aloud”. Another competition we applied to for funding turned us down and one of the reasons they gave is that we were proposing too much.
We ended up doing a mixed methods design, collecting both qualitative and quantitative data and I’m very glad I did not listen to any of these people telling me that it was too much.
There is no substitute for statistics.
When I observed the students in the labs, I thought that perhaps the grade level assigned to specific problems was inconsistent with what the students could really do. For example:
Add and subtract within 1000 … is at the second-grade level
Multiply one-digit numbers … is at the third-grade level
It seemed to me that students were having a harder time with the supposedly second-grade problem, but I wasn’t sure if that was really true. Maybe I was seeing the same students miss it over and over. After all, we had 591 students play Spirit Lake in this round of beta testing. It was certainly possible I saw the same students more than once. It is definitely the case that students who were frustrated and just could not get a problem stuck in my mind.
So …. I went back to the data. These data do double-duty because I’m teaching a statistics class this fall and I am a HUGE advocate of graduate students getting their hands on real data, and here was some actual real data to hand them. (I always analyze the data in advance so it is easy to grade the students’ papers, to give examples in class and so l don’t get student complaining that I am trying to get them to do my work for me, although they still do. Ha! As if.)
We had 1,940 problems answered so, obviously, students answered more than one problem each. Of those problems, 1,053, or 54.3% were answered on the first attempt. This made me quite happy because it is close to an ideal item difficulty level. Too easy and students get bored. Too hard and they get frustrated.
I used SAS Enterprise guide to produce the chart below:
You can see that the subtraction problem showed up about mid-range in difficulty. Now, it should be noted that the group gets more selective as you move along. That is, you don’t get to the multiplication problems unless you passed the subtraction problem. Still, it is worth noting that only 70% of fourth- and fifth-grade students in our sample answered correctly on the first try a problem that was supposedly a second-grade question.
Because we want students to start the game succeeding, I added a simpler problem at the beginning. That’s the first bar with 100% of the students answering it correctly. I won’t get too excited about that yet, as I added it later in the study and only a few students were presented that problem. Still, it looks promising.
So, what did I learn that I couldn’t learn without statistics? Well, it reinforced my intuition that the subtraction problem was harder than the multiplication ones and told me that a substantial proportion of students were failing it on the first try. It was not the same students failing over and over.
The second question then, was whether the instructional materials made any difference. I’m pleased to tell you that they did. On the second (or higher) attempt, 85% of the students answered correctly. If you add the .85 of the 30% who failed the first go-round to the 70% who passed on the first attempt, you get 92% of the students continuing on in the game. This made me happy because it shows that we are beginning at an appropriate level of difficulty. I would have liked 100% but you can’t have everything.
I should note that the questions are NOT multiple choice, and in fact, the answer to that particular problem is 599, so it is not likely the student would have just guessed it on the second attempt.
More notes from the text mining class. …
This is the article I mentioned in the last post, on Singular Value Decomposition
Contrary to expectations, I did find time to read it, on the ride back from Las Vegas and it is surprisingly accessible even to people who don’t have a graduate degree in statistics, so I am going to include it in the optional reading for my course.
Many of these concepts like start and stop lists apply to any text mining software but it just happens that the class I’m teaching this fall uses SAS
In Enterprise Miner, you can only have 1 project open at a time, but you can have multiple diagrams and libraries, and of course, zillions of nodes, in a single project
In Enterprise Miner, can use text or text location as a type of variable. Documents < 32K in size can be contained in project as a text variable. If greater than 32K, give a text location.
- start lists – often used for technical terms
- stop lists, e.g. articles like “the”, pronouns. These appear with such frequency in documents they don’t contribute to our goal which is to distinguish between documents. May also include words that are high frequency in your particular data. For example, mathematics, in our data, because it is in almost every document we are analyzing
Multi-word term tables – standard deviation is a multi-word term
Importing a dictionary — go to properties. Click the …. next to the dictionary (start or stop) you want to import. When it comes up with a window, click IMPORT
Select the SAS library you want. Then select the data set you want. If you don’t find the library that you want, try this:
- Close your project.
- Open it again
- Click on the 3 dots next to PROJECT START CODE in the property window
- Write a LIBNAME statement that gives the directory where your dictionaries are located.
- Open your project again
[Note: Re-read that last part on start code. This applies to any time you aren't finding the library you are looking for, not just for dictionaries. You can also use start code for any SAS code you want to run at the start of a project. I can see people like myself, who are more familiar with SAS code than Enterprise Miner, using that a lot.]
Filter viewer – can specify minimum number of documents for term inclusion
Speaking of Las Vegas, blogging has been a little slow lately since we took off to watch The Perfect Jennifer get married. It was a very small wedding, officiated by Hawaiian Elvis. Darling Daughter Number Three doubled as bartender and bridesmaid then stayed in Las Vegas because she has a world title fight in a few days.
Given the time crunch, I was particularly glad I’d attended this course that gave me the opportunity to draft at least one week’s worth of lectures in the fall. When I finish these notes, my plan is to to edit them and turn it into the last lecture in the data mining course. If it’s helpful to you, feel free to use whatever you like. I’ll try to remember to post a more final version in the fall. If you have teaching resources for data mining yourself, please let me know.
My crazy schedule is the reason I start everything FAR ahead of time.
Maybe this is obvious, but I have often found that what is obvious to some people is not so obvious to others, so here are a few random tips.
1. Enterprise Miner can take a REALLY long time to load during which you wonder if anything is happening at all.
Open up the task manager and look for something that says javaw.exe *32 You can see it near the bottom in the image above. The number next to it should be going up, from 30,000 to 50, 000 etc. If it is, you should probably be patient for a few more minutes and your session will start.
2. Let’s say you want to change the properties of something. For example, I don’t want the data set to be partitioned into Training, Validation and Test in a 40, 30, 30 split. I want it to be 50, 50, 0. So, I right-click on the DATA PARTITION node, get a drop-down menu and
there is all of this stuff about Edit Variables all the way down to Disconnect Nodes, where the hell are the properties to change? They’re on the left, in that window with the title Property! Funny, but it’s so easy to focus on the diagram window and completely forget about everything else. Click on a node and it’s properties will show up in the window.
3. While the three screens you see when you run the StatExplore node are pretty interesting, it would be nice to have a more detailed look at your data. Just go to the VIEW menu and you can get more statistics, like the cell chi-square values, descriptive statistics of numeric variables broken down by the levels of your target variable.
After all of the effort to get Enterprise Miner installed, I thought it better do something good. It is interesting to use. Unlike programming where you can get a program to run but give you errors or unexpected results, so far (key phrase!), with Enterprise Miner I have found the problem to be knowing exactly what to select, for example, with CREATE DATA sources. Once you know that, however, it seems pretty hard to make an error.
Enterprise Miner does do some pretty cool stuff, which makes it worth the pain of getting it installed. Even way cooler, unlike back in the day when no one could get their hands on it without paying approximately $4,893,0893.16 , their first born child, their left kidney and an albino goat, if you are an instructor or a student, you can get it for free through SAS On-Demand for Academics.
(And, yes, for the record, I *am* aware that said goat is not an albino. I was fresh out of pictures of albino goats. Deal with it.)
Speaking of Enterprise Miner, I thought I would ramble on about the good parts for a few posts, since I’m getting ready to teach data mining in the fall and I hate to do anything at the last minute.
One of the good parts is StatExplore. At first glance, it looks good, but at second glance, it looks better.
All you need to do is create a diagram by going to the FILE menu, then selecting NEW and then DIAGRAM.
You can start by dragging a data source on to the diagram. In this example, I used the heart data set from the Framingham Heart Study, which happens to ship with Enterprise Miner in the SASHELP library.
I drag the data set from data sources to the diagram window.
Next, I click on the EXPLORE tab just above the diagram window. This gives you a bunch of icons. Enterprise Miner is just rife with icons. Never fear, though, if you have no idea what this bunch of colored boxes is supposed to mean versus that bunch, just hover over the icon with your mouse and it will tell you.
Here is my diagram. Simple, no? It gives you a bunch of cool stuff. First, you have the plot of chi-square values for all nominal variables.
You can see that sex has the highest chi-square (as in gender, not as in frequency of), followed by cholesterol status, smoking status and weight status. I find this rather surprising. I knew women lived longer than men, but with all of the discussion of obesity, I thought weight would be higher up there.
The next chart gives me the worth of each variable in predicting my target, which in this example is death.
The variable on the far left is age at start. Not surprisingly, the older people are when you start following them, the more likely they are to die in a given period of time. The next variable is Age at CHD Diagnosis, followed by two blood pressure measures, their cholesterol, then cholesterol status – weight status is down at the end.
This analysis produces A LOT of statistics. This, I found interesting because despite some people arguing Enterprise Miner allows analysis by someone without extensive programming or statistics background, certainly in the case of statistics, the more knowledge you have, the better you could make use of the results.
For example, in the top right (all three of the screen shots above are one screen, I broke them up at an attempt at legibility), the output pane gives descriptive statistics broken down by each level of the target variable. I can see how many people who died had missing data for age at CHD diagnosis, skewness and kurtosis values for variables by status, living or dead, the mode for weight status for people who were living or dead, and a whole lot more. Interestingly, 68% of the whole sample was overweight.
Scrolling through the statistics output I can get a good idea of the data quality – is it skewed, is it missing, is it missing at random.
Without some background in statistics, that’s probably no more than a bunch of numbers. Personally, I found it very helpful. That’s another assignment for the students, to write a brief summary of their data, including any concerns. There weren’t any real problems with these data except for the obvious fact that variables like cholesterol and cholesterol status,smoking and smoking status are going to be highly correlated. It would be a good idea to include one of those as input in any predictive analyses and reject the other to prevent multicollinearity problems.
(NOTE to self: Make sure to explain variable roles, changing variable roles in EM and multi-collinearity.)
You might think this is adequate for running just one node, but, in fact, there is much more here than meets the eye. More on that tomorrow because speaking of overweight, I have been at a computer for 13 hours today and I want to hope on the bike and get some exercise in before I knock out the last task I need to do today. Although @sammikes just pointed out on twitter that round is a shape, it is not the one I want to be in.
Most likely, you,too, have experienced homicidal urges when confronted with a problem you have spent five hours trying to solve on your computer, only to call tech support and have them report,
Well, it works fine on my computer.
You’d think if that solved the problem that they would offer to box up their computer and send it over to your house but, alas, they never do.
This is the reason that any software I use for class I test on several computers under different conditions. After having initially failed to get SAS On-Demand for Enterprise Miner to work with boot camp on the Mac, I tried it on a Lenovo machine running Windows 8. I had to install the JRE and ignore a few security warnings, but after that it worked.
[For how I did eventually get it working with boot camp, click here, and thank Jason Kellogg from SAS. ]
Next, I needed to upload some data. The SAS instructions say to use your favorite FTP client and coincidentally, I do have a favorite FTP client (Filezilla), so I downloaded it to the testing machine.
Only the professor can upload data to the class directory, and most professors probably have an FTP program on their personal computer (or maybe not, do you?) Even if you normally do, you may, like me, have borrowed a machine to use for testing or have a new computer. Whatever, this just reinforces my argument that you should never, never plan to use any kind of software in a class unless you have ample time to prepare.
I know that there are schools that ask adjuncts to teach on a week or two notice. That seems to me a recipe for disaster for both the professor and students, unless maybe you are doing something that hasn’t changed in 50 years and requires no technology, like reading Chaucer, I recommend you follow the advice of Nancy Reagan and “Just say no.”
Here are my first few hints:
- Test the software on multiple machines and multiple operating systems.
- Make sure one of those machines is on the older, under-powered end of the spectrum, as students often don’t have a lot of extra cash and may not have the shiniest, newest machine like you have on your desk.
- Test it on the latest operating system. It may turn out that the version your school has does not work with Windows 11. (I did not have that problem with the Enterprise Miner this time, but I’ve had it with other software in the past so it is a good idea.)
- Find out what other software you might need, for example, some kind of FTP program in this case, and install it on your computer, if necessary.
- Give yourself plenty of time to do all of the above.
You might think these types of things would be handled by the information technology department at your university, and you may be really lucky and that will be so. In many schools, the IT department basically helps re-set passwords, assigns school email addresses, helps to get discounts on software and upload files to Blackboard and not much else.
For years, I have been trying to figure out where the $50,000 a year or so tuition goes. It isn’t to adjunct professors and it isn’t to the IT staff. It also isn’t to buying the latest technology because, more and more often, students are expected to bring their own device.
You may think that none of the above should be your job and you may be right, but I am just saying if you want to anticipate the frustrations your students will experience and be able to solve their problems during the lecture by directing them to a link on your class website/ blog your life and theirs will both be a lot easier.