When I was running out to the airport, I said I would explain how to get a single plot of all of your scatter plots with SAS. Let’s see if I can get this posted before the plane lands. Not likely but you never know …
It’s super simple
proc sgscatter data=sashelp.heart ;
matrix ageatdeath ageatstart agechddiag mrw / group= sex ;
And there you go.
Statistical graphics from 10,000 feet. Is this a full service blog or what?
My life is upside down. All day, as my job, I spent writing a program to get a little man to run around a maze, come out the other end and have a new screen come up with a math challenge question. Then, in the evening, I’m surfing the web for interesting bits to read on multivariate statistics.
I’m teaching a course this winter and could not find the Goldilocks textbook, you know, the one that is just right. They either had no details – just pointing and clicking to get results – or the wrong kind of details. One book had pages of equations, then code for producing some output with very little explanation and no interpretation of the output.
I finally did select a textbook but it was a little less thorough than I would like in some places. I decided to supplement it with required and optional readings from other sources. Thus, the websurfing.
One book I came across that is a good introduction for the beginning of a course is Applied Multivariate Statistical Analysis, by Hardle and Simar. You can buy the latest version for $61 but really, the introduction from 2003 is still applicable. I was delighted to see someone else start with the same point as I do – descriptive statistics.
Whether you have a multivariate question or univariate one, you should still begin with understanding your data. I really liked the way they used plots to visualize multiple variables. I knocked a few of these out in a minute using SAS Studio.
symbol1 v= squarefilled ;
proc gplot data=sashelp.heart ;
plot ageatdeath*agechddiag = sex ;
plot ageatdeath*ageatstart = sex ;
plot ageatdeath*mrw = sex ;
Title “Male ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10) ;
where sex = “Male” ;
Title “Female ” ;
proc gchart data=sashelp.heart ;
vbar mrw / midpoints = (70 to 270 by 10);
where sex = “Female” ;
If I had more time, I would show all of these plots in one page – a draftman’s plot – but I’m running out to the airport in a minute. Maybe next time. Yes, I do realize these charts are probably terrible for people who are color-blind. Will work on that at some point also.
You can see that the age at diagnosis and death is linearly related. It seems there are many more males than females and the age at death seems higher for females.
The picture with Metropolitan relative weight did not seem nearly as linear, which makes sense because if you think about it, age at start and age at death HAVE to be related. You cannot be diagnosed at age 50 if you died when you were 30.It also seems as if there is more variance for women than men and the distribution is skewed positively for women.
The last two graphs seem to bear that out, ( You can see those here – click and scroll down). which makes me want to do a PROC UNIVARIATE and a t-test. It also makes me wonder if it’s possible that weight could be related to age at death for men but not for women. Or, it could just be that as you get older, you put on weight and older people are more likely to die.
My point is that some simple graphs can allow you to look at your variables in 3 dimensions and then compare the relationships of multiple 3-dimensional graphs.
Gotta run – taking a group of middle school students to a judo tournament in Kansas City. Life is never boring.
I’m pretty certain I did not deliberately hide these folders. When I opened up my new and improved SAS Studio, it had tasks but my programs were missing.
If this happens to you and you are full of sadness missing your programs, look to the top right of your screen where you see some horizontal lines. Click on those lines.
A menu will drop down that has the VIEW option. Select that and select FOLDERS. Now you can view your folders where your programs reside. Based on this, you might think I’m against using the tasks. You’d be wrong. I just like having the option to write SAS code if needed. The tasks are super easy to use and students are going to love this. Check my multiple regression for an example
I selected the data set from the SASHELP library, the dependent and independent variables and there you are – ANOVA table, parameter estimates, plot of dependent observed vs predicted and if you scrolled down – not here because this is just a screen shot, but in SAS Studio, you’d see all of your diagnostic plots. Awesome sauce.
Yes, I do realize that I’m probably far more excited about our new website coming on line than is normal. Several points here on a Friday night:
- I completely disagree with those entrepreneurs who say, “You sell the sizzle not the steak” when what they mean is that they really don’t have a good product but just a good story a.k.a. a line of bullshit.
- I think we have benefited from never hiring anyone in our company who has experience as a middle manager.
- You’re better off having a great product and a lousy website than the other way around.
- Not having too much money can be a benefit when starting a business.
Back in the paleolithic era when I was in undergraduate marketing classes, they drilled into us the four P’s – product, price, promotion and place. There were lots of things I learned in business school that I disagreed with, but one I have found to be true to this day is that the most important of those four P’s is product. If your product is terrible, you may get people to buy it once if it’s cheap enough, they live close enough or you advertise it enough, but they aren’t going to buy it again.
Since we began 7 Generation Games, our priority has been making math awesome. Our first game had a lot of problems, many of them due to incompatibilities with web browsers, being stopped by school district firewalls. Ever call technical support and the person on the other end of the line says to you,
“Well, it works on my computer.”
Yeah, it was like that. So, we have been working like crazy to add every feature, correct every bug reported by our infinitely patient and wonderful alpha and beta testers (we love you guys). We still have, literally, hundreds of improvements we want to make, and I expect we always will. I work on them every day. Spirit Lake: The Game works. It doesn’t crash, it has lots of math and kids like to play it. Fish Lake is in process. Making a good game was our highest priority and still is. We just hired another developer (yay!) to help us out, are ramping up the artwork for the next two games, hired people as testers, an audio engineer …
Now that we have more people working in our company we have started to implement some actual policies and procedures. We have a git repository, use a source management system, an issues tracking system, file sharing system. We signed up for Amazon Web Services, Google Apps for Work, basecamp, some payroll system Donna manages – a lot of stuff I thought would be useless for us at the beginning. This is why I am glad we never hired anyone who had been a middle manager – because I was right. That stuff would have been useless for us at the beginning. It would have wasted our time and kept us from doing the most important work of making a good product. When do you add that layer of management? When you find yourself swearing,
“Damn it, we NEED a way to make sure you’re not copying over the changes I just made!”
When you only have two people working, and both in the same house, one can holler upstairs to the other,
“Hey, I’m working on level 4 today, okay? So, don’t touch it.”
At that point, you don’t need version control. Now, we do. When we did hire a project manager, we hired someone who had run a small business for ten years who shared our idea of having the degree of management you absolutely need and no more.
Finally, finally, finally, we are updating the 7 Generation Games website which, I believe, Maria originally put together in four hours one afternoon. It isn’t as if we didn’t know it needed a huge improvement. We believed our less than infinite time was best spent improving the game, meeting with customers, getting their feedback, designing more levels. We’re a small company. At Unite 2014, I attended a session where a developer mentioned they had 50 people working on their game for 2 1/2 years and it still wasn’t finished – that’s 125 person-years! That’s just people making the game – not managers, marketing, accounting. We’ve spent something more than 2.5 person-years developing ours, which explains why we constantly feel like we need to put every spare second into development.
Having the luxury to worry about the website says something about how we have matured as a company. With new people hired to take the non-development work off of us and additional people picking up some of the development work, we no longer can say,
“Having a spiffy new website is the least of our problems.”
In fact, it’s been bugging the hell out of me for a few months now. Did I feel bad about it? Yes. Like the source management system, when it got to the point where it felt like,
“Damn it, we need this!”
“Brother, I got 99 problems and that aint one of ’em”
that it was time to get it done.
I’ve had people tell me that we should have been working on our website with bells, whistles and gold tassels before now because “VCs won’t be impressed if you don’t have a professional website.”
Hmm. Not sure VCs will be impressed if you don’t have a product, either. I know companies that started about the same time as 7 Generation Games and had terrific website, brochures, every social media account you can imagine, unbelievably honed pitches – and they evaporated because they were all sizzle and no steak.
I’ve written before about Paul Hawken’s recommendation that in growing a business that you do as much for yourself as possible. That’s a whole post in itself, but to cut to the point – you keep your overhead low, which means you don’t require external funding in the short run. You are more viable in the long run not just because you have low debt and low operating expenses, but you also have the asset of everything you have learned yourself.
But we still hired someone else to update the website (-:
I’ve been pretty pleased with SAS Studio (the product formerly known as SAS Web Editor), so when Jodi sent me an email with information about using a virtual machine for the multivariate statistics course, I was a bit skeptical. Every time I’ve had to use a remote desktop connection virtual machine for SAS it has been painfully slow. I’ve done it several times but it’s probably been like in 2001, 2003 and 2008 when I was at sites that tried, and generally failed, to use SAS on virtual machines.
Your mileage may vary and here is the danger of testing on a development machine – I have the second-best computer in the office. I have 16GB of RAM and a 3.5 GHz Intel Core i7 processor. Everything from available space (175 GB) to download speed (27Mbps) is probably better than the average student will have.
The previous occasions I was using SAS on a remote virtual machine I had pretty good computers, too, for the time, but 6 -13 years is pretty dramatic differences in terms of technology.
That being said, the virtual machine offered levels of coolness not available with SAS Studio.
Firstly, size. I did a factor analysis with 206 variables and 51,000 observations because I’m weird like that. I wanted to see what would happen. It extracted 49 factors and performed a varimax rotation in 16.49 seconds. I don’t believe SAS Studio was created with this size of data set in mind.
Secondly, size again. The data sets on the virtual machine added up to several times more than the allowable space for a course directory in SAS on-demand.
Thirdly, it looked exactly like SAS because it was.
Now, I do realize that the virtual machine with SAS is probably only allowable if your university has a site wide license from SAS.
SAS Studio remains as having the significant advantage of being free and easy. It also seems to have morphed overnight. I don’t remember these tasks being on the left side, and while they look interesting and useful, they do NOT
- Encompass all of the statistics students need to compute in my classes, e.g. , population attributable risk.
- Explain where the heck my programs went that I wrote previously. I can still create a new program and save a program and it even shows the folders I had previously as choices to save the new program.
#1 is easily taken care of if I can just find out where the programs are saved, for statistics not available in the task selections, they can just write a program. I’ll look into that this weekend since I have had to get up THREE days this week before 9 a.m. I am thinking I need to get some sleep.
From my initial takes of the latest versions of each, I think I will:
- Use SAS Studio for my biostatistics course because it is an easy, basic introduction AND, once I figure out where the programs are hidden, I can have students write some simple programs. (It may be in an obvious place but sleep deprivation does strange things to your brain.)
- Use the virtual machine for multivariate statistics because it allows for larger data sets and, although I did not have a similar size data set in SAS Studio, I am assuming it will run much faster.
It’s still technically the weekend so I’m not blogging about statistics until tomorrow. After some debate, I do think I have a multivariate stat textbook selected, though, so that’s good.
I got an invitation to go to a luncheon an old friend of mine sets up every few months for a bunch of us that used to work out together back in the 1980s. Almost all of those who attend are retired and three of the guys (they are all guys except for me) passed away in the past year or so. We’re getting to “that age”.
Ten or fifteen years ago, it was everyone’s parents that were dying. I competed in judo for 14 years, and taught it for another 30, so I have a lot of friends and acquaintances who are Japanese. At Japanese Buddhist funerals, at least here in Los Angeles, there is a song they always sing in Japanese. One day, after a half-dozen or more of our friends’ parents had passed away in a pretty short period of time, my friend, Hayward, leaned over to me during the service and said,
You know, I’m getting a little worried. I’m starting to know all of the words to that song.
Now, though, it is our friends and acquaintances who are dying, not their parents. That’s the sort of thing that gives you pause. Every time I go to one of those luncheons, we are talking about the people who we miss who aren’t there any more.
I’ve been working a lot of hours – as always. The Spoiled One talked me into going out with her twice this weekend,
once to have dinner at the end a pier in Malibu at sunset, and once to go hiking in the Santa Monica mountains. She can be a good influence sometimes. She goes to boarding school during the week and she asked,
I’m only home on the weekends. You have all week to work. Why do you have to work while I’m here?
I didn’t have a really good answer to that.
The Invisible Developer said to me,
You know, I think we’re getting to that age where I don’t think we should have to do much other than what we want to do.
Ignoring the fact for a moment that a) he may be correct and b) that does reflect that we have a privileged life that most people in the world don’t attain, I spent Sunday until midnight writing the budget justification for a grant which was decidedly something I did NOT want to be doing. I have made an adjustment now that at midnight, I try to quit working, no matter what.
I made a major decision to write, at most, one more grant. If we get the Phase I that I’m writing now, I’ll write the Phase II. Other than that, I’m done. Over it. Seriously, after you’ve brought in $30 million or so, is getting $30,100,000 going to make a difference in your overall accomplishments? That’s why I quit keeping track of grants funded after the first $30 million and just put down the latest ten or so. No, I don’t get to keep that money, either. It gets paid out over the years in salaries, rent, supplies, student scholarships and all of the other stuff the grants were written to do.
Once this grant is done, I will be working on the games and doing nothing else in September and October. Then, there are a few months of teaching classes and another six months after that of just working on the games.
I’m saying, “No” a lot.
- No, I am not interested in another consulting contract.
- No, I don’t want to work on a journal article with you.
- No, I’m not writing another grant.
- No, I’m not teaching any more classes.
- No, I will not present at your conference.
I’m throwing all of my eggs in one basket, working on making our games better and better. We’re taking a risk, focusing just on this and hiring more people to work on the games to boot. There is actually a lot of statistics in it, too, both analyzing the data we’re in the middle of collecting and in our next game, under design, which teaches statistics.
Maybe someone else would retire and lay on a beach, but I’ve tried that a few times and I’m a completely failure at it.
What else would I do? This is what I do.
It seems like a good answer. As for me, what I do is game design, coding, statistics. I’m just going to do that. It occurs to me that I have just written either the last or next-to-last grant budget I am ever going to write. And that makes me very, very happy.
The new common core standards have statistics first taught in the sixth grade, or so they say. I disagree with this statement because as I see it, much of the basis of statistics is taught in the earlier grades, although not called by that name. Here are just a few examples:
- Bar graphs
- Line plots
- X,Y coordinates
- Fractions and decimals (since the mean is rarely going to be an integer)
- Ratios and proportions – in summarizing a data set, it’s pretty common to point out, for example, that the ratio of game fish to non-game fish was 3:2. We are often asking if the percentage of something observed is disproportionate to the percentage in the population.
It doesn’t bother me that these topics are not called statistics, I’m just pointing it out. Whether a line is considered a regression line or simply points in two-dimensional space is a matter of context and nothing else.
Speaking of lines and graphs, the very basis of describing a distribution starts with graphing it. So, those second grade bar graphs? Precursors to sixth grade requirements to summarize and described a set of data.
You might say I’m going to an extreme including fractions in there because I may as well throw in addition and division. After all, you need to add up the individual scores and divide by N. Actually, I wouldn’t argue too much with that view.
You can’t even compute a standard deviation without understanding the concepts of squares and square roots, so it would be easy to argue that is at least a prerequisite to statistics.
While I’ve heard a lot of people hating on the common core, personally, I’m interested in seeing how it plays out.
What I expect will continue to happen is that many children will be turned off of math by the third grade because it is generally taught SO abysmally. That isn’t all the fault of teachers – the books they are given to use are often deathly boring. This isn’t to say I am not bothered by the situation. It bothers me a lot.
Working mostly in low-performing schools, I see students who are not very proficient with fractions, proportions, exponents or mathematical notation. We are trying to design our games to teach all of those prerequisites and then start showing students different distributions, having them collect and interpret data.
Lacking prerequisites is one of the three biggest barriers I see in teaching statistics, or any math, to students. The other two are related; low expectations for what students should be able to learn at each grade, and the fiction on the part of teachers and students that everything should be easy.
People were all up in arms years ago because there was a Barbie doll that said, “Math is hard.”
Guess what? Math is hard sometimes and that is why you have to work hard at it. Even if you really like math and do well at it in school, even if it’s your profession, there are times when you have to spend hours studying it and figuring something out.
Today, I was reviewing textbooks for a course I’ll be teaching on multivariate statistics. I didn’t like any of the three I read for the course, although I found one of them pretty interesting just from a personal perspective. The one I liked had pages after pages of equations in matrix algebra and it would be a definite stretch for most masters students. I’m really debating using it because I know, just like with the middle school students, there will be many lacking prerequisites and it will take a LOT of work on my part to explain vectors, determinants before we can even get to what they are supposed to be learning.
Last week, I had someone seriously ask me if we could make our games “look less like math so that students are learning it without realizing it”. No, we cannot. There’s nothing wrong with learning math that you need to disguise it to look like something else.
Whenever I catch myself thinking in designing a game, “Will the students be able to do X?” and I think they will not because they are lacking the prerequisites, I build an earlier level to teach the prerequisites and go ahead and include X anyway.
Here is why — I’m sitting at the other end teaching graduate students where the text begins like this:
root mean square residual (RMR) For a single-group analysis, the RMR is the root of the mean squared residuals:
is the number of distinct elements in the covariance matrix and in the mean vector (if modeled).
For multiple-group analysis, PROC CALIS uses the following formula for the overall RMR:
“The probability distribution (density) of a vector y denoted by f(y) is the same as the joint probability distribution of y1 …. yp . “
“It is easy to verify that the correlation coefficient matrix, R, is a symmetric positive definite matrix in which all of the diagonal elements are unity.”
I’m working on a section of a game that teaches fractions. If a player misses the question about where to meet up with the returning hunter, he or she gets sent to study. There is a movie that plays before this about needing to get back to the camp before dark.
Here is the question,
“The sisters begin to worry their brothers won’t make it back by dark. They start down the trail to meet them. They decide to stop and wait at the spot where their brothers will be 3/4 of the way back to camp. How far FROM the camp will the girls be?”
I used this question because I want students to think about a few ideas:
- Distance between two points can be thought of as a whole.
- If you are a/b distance FROM point X, the remaining distance TO point X is 1 – a/b . Of course, I don’t expect them to state it like that.
- 1/4= 2/8
- Number lines can be numbered in either direction. You can have 0 on the left or 0 on the right. The distance will be the same. The size of each interval will be the same.
These are kind of important ideas in math – equivalence, the arbitrary nature of labeling points on a line.
Students can click on GIVE ME A HINT, and a hints page pops up that explains, among other things, why you were wrong if you answered that the sisters would be 3/4 of the distance to the hunting grounds FROM the camp. If, even after reading the hints, (or if they skip the hints and just guess, we’re talking kids, after all) they get the problem wrong, the player is sent to watch a video clip explaining the problem, and then has to take a quiz to get back to the game.
SO … I had the thought instead of writing the quiz questions out of thin air, I might read what some more experienced teachers were giving to students in this grade as math problems. After all, I haven’t taught middle school math since the 1980s. I went to several sites, I even purchased some things like “One year of fifth-grade homework problems” etc.
When I looked at page after page of what students are being given as homework assignments, the only thing I could think was “Are you fucking kidding me? No wonder kids hate math.”
All of the homework was like this:
1/4 + 1/3 = ?
For FIFTY problems. That’s it! Then, the next day, it would be another fifty problems like this:
5/6 – 1/4 = ?
Okay, you need to learn to add and subtract fractions, but is that ALL you need to learn? Obviously not. How boring must it be to sit and just calculate answers to the same type of problem over and over? This stuff made me start to hate math and I LOVE math.
How can you possibly think that is teaching kids math? That’s like making them copy down all of the words in the dictionary and pretending you taught them literature.
Don’t even get me started on teaching statistics – wait, too late. I’m started. That is my rant for tomorrow.
Sometimes the benefits of attending a conference aren’t so much the specific sessions you attend as the ideas they spark. One example was at the Western Users of SAS Software conference last week. I was sitting in a session on PROC PHREG and the presenter was talking about analyzing the covariance matrix when it hit me —
Earlier in the week, Rebecca Ottesen (from Cal Poly) and I had been discussing the limitations of directory size with SAS Studio. You can only have 2 GB of data in a course directory. Well, that’s not very big data, now, is it?
It’s a very reasonable limit for SAS to impose. They can’t go around hosting terabytes of data for each course.
If you, the professor, have a regular SAS license, which many professors do, you can create a covariance matrix for your students to analyze. Even if you include 500 variables, that’s going to be a pretty tiny dataset but it has the data you would need for a lot of analyses – factor analysis, structural equation models, regression.
Creating a covariance data set is a piece of cake. Just do this:
proc corr data=sashelp.heart cov outp=mydata.test2 ;
var ageatdeath ageatstart ageCHDdiag ;
The COV option requests the covariances and the OUTP option has those written to a SAS data set.
If you don’t have access to a high performance computer and have to run the analysis on your desktop, you are going to be somewhat limited, but far less than just using SAS Studio.
So — create a covariance matrix and have them analyze that. Pretty obvious and I don’t know why I haven’t been doing it all along.
What about means, frequencies and chi-square and all that, though?
Well, really, the output from a PROC FREQ can condense your data down dramatically. Say I have 10,000,000 people and I want age at death, blood pressure status, cholesterol status, cause of death and smoking status. I can create an output data set like this. (Not that the heart data set has 10,000,000 records but you get the idea.)
Proc freq data= sashelp.heart ;
*Smoking /noprint out=mydata.test1;
This creates a data set with a count variable, which you can use in your WEIGHT statement in just about any procedure, like
proc means data = test1 ;
weight count ;
var ageatdeath ;
Really, you can create “cubes” and analyze your big data on SAS Studio that way.
Yeah, obvious, I know, but I hadn’t been doing it with my students.
On my way home from the 2014 Western Users of SAS Software conference. When I was younger, I would go to every basic session trying to find something I could use that wasn’t over my head. As I got older, I went to the statistics sessions to see if there was anything new or more advanced I had not mastered yet.
Now that I’m really old, I just do my own presentations and then spend the rest of the conference wandering around to anything that looks interesting. Sometimes, the most interesting stuff is the questions after a session or just the random people I run into in the hallways.
Interesting stuff: Part 1 Data coolness
I had used the California Health Interview as example data for classes but I was not aware of the huge breadth of data available there. Also, if you are a researcher and ask them nicely they will create data sets for you, as long as the data are available and it can be done without violating confidentiality requirements. Check them out here.
Say you wanted to chart the number of amputations per 100,000 workers over the past six years. The state of California has you covered.
That was pretty random, yes? Want Pneumoconiosis hospitalizations? Just check it out if you ever need health data, death data, politics – anyway, good resource.
Interesting stuff parts: 2, 3 & 4 which I hope to write about this week
Another random idea that I have certainly had before but never implemented … eek, I have to go check out but remind me it has to do with getting around the SAS Studio limit on ginormous data.
Also, F-test , p-value and r-square
And permutations, random data, bootstrapping and creating your own version of F-tests, t-values and p-values