I finally am getting around to something in SAS Studio that I think is really cool – the tasks.
Although they don’t look identical to SAS Enterprise Guide just because the screen layout is a little different, these are really, really similar to the tasks you would see in EG.
If you are using this with a real beginner class, you can start out with using the sashelp directory. Otherwise, the only programming they will need is to assign their directory at the beginning. Run that program and away you go .
Let’s take an example with SASHELP.
1. Log into your SAS on-demand account and select SAS Studio
2. Click on tasks in the left pane.
3. Select statistics
4. Select distribution analysis
Note that Studio will show you the type of variables required for your analysis
It will also tell you in the right window pane whether you are unable to run your analysis and, if so, what you are missing
To select a variable, just click the + sign and you’ll see a list of the variables in your data set
Select the desired variables for the roles and click your little running guy at the top. (This will be greyed out until you have all of the required fields filled in.)
Your results will appear in the right window pane.
If you want to see the code, click on the code button.
proc univariate data=SASHELP.BWEIGHT noprint ;
class married ;
histogram Weight ;
To keep your program, copy the code
- Click on folders
- Click on the far left button and select new program
paste in your code.
Save your program.
You could do a whole bunch of analyses this way and come up with a lot of results and a long SAS program without knowing anything about programming SAS. Some people would find that horrifying but I think it’s really cool.
A really interesting assignments for students would be to take a data set in the sashelp library, do as many tasks as they can and see what interesting results and conclusions they can find.
Are you kidding me?
If you are a programmer, analyst, statistician, professor or student who uses SAS this is an opportunity to get to know your people and to get known.
I’m in Dallas for the SAS Global Forum, which I try to attend whenever I can. Yes, I could watch videos on the Internet, read books, read web pages, but I often don’t because I have a to-do list a mile long.
By presenting at the conference, I have to review what I am doing in teaching with SAS Studio and why.
SHAMELESS PLUG: My session on Preparing Students for the Real World with SAS Studio is a good one for both anyone who teaches with SAS and for anyone who is new to the SAS world and wants a good introductory session.
Since I am at the conference, I have a little bit of downtime to look into SAS resources. My new favorite is SAS communities. It’s a combination forum and free library. I must have looked into it at some point, because I had an account, but it seems to be more active now. I even submitted an article and poked around in the forum.
Then, of course, there are all of the sessions that I will attend, conversations I will have with people, books I will hear about and buy, to read on the plane ride home.
It’s a week of learning.
But , but, you stutter like a motor boat, it’s expensive and far away. I can’t afford it. Besides, I would feel uncomfortable presenting at the same conference with all of those people who wrote the books on SAS (literally).
The expensive part I get. The not feeling like you could present at the same conference part is just silly, so I’m going to pretend you didn’t say that.
If travel and cost is an issue, present at your local conference. The call for papers for the Western Users of SAS Software (WUSS) is open. Do it now!
It is painless. You submit a 300-word abstract. You can submit a working draft of the paper at the same time. That’s not mandatory but it improves your chances.
There is even a mentoring program where old people (like me), will help you revise your program and get ready to present.
Writing and presenting the paper will force you to think about what you are doing and why. You will likely make some contacts of people who will be potential employers, collaborators or drinking buddies.
What are you waiting for? A personal invitation?
Fine! Here you go.
Need a topic? Here are 10 I would like to see
- The 25 functions I use most.
- Uses of PROC FORMAT .
- Multinomial logistic regression.
- The many facets of PROC FREQ.
- Factor analysis
- SAS for basic biostatistics
- Macro for data cleaning
- Model selection procedures
- Mixed models vs PROC GLM
- SAS Graphs without SAS/Graph (because SAS/Graph appears to be written in Klingon)
My point is that if I sat here and thought of 10 off the top of my head after two glasses of Chardonnay and half a glass of the champagne someone who will remain nameless bought at Costco and brought here from a state in the WUSS region, then I’ll bet you could come up with something really awesome stone-cold sober and given more than 60 seconds.
Let’s recap what we have learned here, shall we?
- Join SAS communities,
- Attend conferences, whether national or global,
- Don’t be a wallflower – present!
- Texas steak and wine is a good combination (not particularly related to SAS but true nonetheless)
As you have no doubt brilliantly deduced from the title, this is installment number four in Mama AnnMaria’s Guide on How Not to Get Your Sorry Ass Fired.
Do the work you are assigned to do.
This may seem like anyone with a brain functioning above zombie level could figure this out. However, I have seen many people who do not appear to be zombies who can’t quite master this concept.
Variation 1: Doing work other than what you are assigned because you are more confident doing something else.
If you are supposed to be writing a new database program but instead you update the wiki because you don’t consider yourself an expert on SQL and PHP then you are on the road to getting your sorry ass fired. Unless you lied on your resume (in which case you deserve to be fired), I’m willing to bet that your boss knows you are not an expert and either:
- Assigned you something that requires below expert level of expertise, or
- Believes you can learn enough to accomplish the task.
In either case, do the work you are assigned to do.
Variation 2: Doing work you were assigned to do later instead of the work you were assigned to do now.
Again, this is often because you feel more confidence in your ability to plan the office Christmas party than updating the operating system. However, it is July and your company is still using Windows XP.
Get on it.
Variation 3: Finding an excuse why you cannot do the work.
This is the quickest path for getting your sorry ass fired. Let me explain how it looks to me:
I have given you work to do because I don’t have enough time to do it all myself. Instead of doing the work, you come back to me with more work, making it my problem instead of your problem. Do you really think I hired you because I was thinking to myself,
“Gee, I don’t have enough work to do. What I really need here is someone who will bring me more work.”
Some people might ask,
“But don’t you want me to tell you if there is a problem?”
Let me explain this to you in short words.
I want you to do the work I assigned you to do. That is why I assigned it to you instead of doing it myself.
This doesn’t mean that, for example, if you don’t know the password to log in to the server you shouldn’t ask someone. It does mean that most of the time, you should solve the problem yourself if possible.
The worst example of this I ever met was a woman who never could do any work because there was always some obstacle in her way. The final straw came the day she told her boss that she had not created and printed the company newsletter because they were out of paper.
Guess whose job it was to order the paper? When he pointed this out to her, she argued that her job description said to order office supplies but no one had specifically told her that the office was out of paper.
She got her sorry ass fired.
It has been pretty well established that I am the worst soccer mom in the history of soccer moms. Most of the games I miss because I am somewhere else. My children have told me that my autobiography should be entitled, “I was out of town at the time” because most of the stories of their childhood begin this way.
Having come back in town shortly before the game this weekend, I was unaware that it was a two-day tournament 2 1/2 hours from home and that we were supposed to have reserved the hotel weeks ago. Hot tip: If you get your reservation last minute and have the choice of a close hotel or a nice hotel, get the nice one.
I fulfilled my obligation. I showed up. During the time The Spoiled One played, I watched. During half time and the breaks between the games I was able to write a couple of blog posts and test out SAS Studio.
If you look at the picture above you might see that I was working in a field surrounded by mountains. Not the best situation for Internet access, which I had via the hotspot on my iPhone.
I was able to log on to SAS Studio with no problem. When I logged in on my iPad I had the screen shown above where I could just start typing my program in the code window.
To see folders, libraries, etc. tap the BROWSE link in the top left corner, as shown
You can tap any of the categories to bring down the list of folders, libraries, etc. You can tap on a file to open it.
The one problem I did have, and depending on your situation, it may be a severe one, was that I could not get any of the libraries to open. I wanted to open the sashelp library and see if I could run some tasks using an open data set. This did not work. It is very possibly related to poor Internet due to laying in a soccer field ringed by mountains. I tried it last year in a movie theater and I was able to access the libraries. In this case, as you might guess from the top photo, the Internet was barely accessible.
Next, I tried simulating a homework problem a student might have, just typing in some data and running the program.
I have a bluetooth keyboard I use with my iPad and it all worked fine. I typed in data, tapped on the little running guy and my program ran fine. You can see the results below.
To save it, I held down the home button and the power button simultaneously, just like any time you take a screenshot on an iPad. Then, I emailed that screenshot to myself, so here you have results.
My point is that a student could do their homework using SAS Studio in the middle of a soccer field on an iPad, as long as it did not require external files, which most of the homework I assign does not. They could then email the results to their professor, still from the (dis)comfort of the field.
This is useful to know for three reasons:
- I travel frequently to areas where there is very limited bandwidth,
- Many of the students in my online courses live in areas with limited bandwidth,
- The Spoiled One’s team won their bracket in the State Cup, so it turns out that means they have more soccer games next weekend as they advanced in the tournament. This is not at the same field surrounded by mountains. It’s at a different field at the edge of the desert. Sigh.
Your students should be able to use SAS Studio almost anywhere, even if all they have is an iPad.
This is doubly true if you don’t assign homework that requires accessing external datasets.
I’ll be able to review homework assignments for the course I am teaching next during the soccer tournament this weekend. (I really AM the worst soccer mom in the history of ever.)
—————— SHAMELESS PLUG
Our Kickstarter campaign is still going on, making adventure games to make math (and history and English) awesome.
We are 84% of the way to our goal!
Is that enough acronyms for you? I’ll be speaking at Celebrating Equity: Women in STEM at ELAC.
STEM = Science, Technology, Engineering and Mathematics
ELAC – East Los Angeles College
It’s 12:15 – 1:15 pm and it is free. There are six panelists (including me).
I presented last year and our company also had a booth. We hired two people who I met there.
Often, I hear people say that their company is all white/ Asian /under 40 because all of the developers / animators/ audio engineers that applied just happened to fit that demographic. Here is a thought – perhaps you could go to, say, a college that is predominantly Hispanic and just maybe Hispanic potential employees might meet you there.
Here is another thought – perhaps if you attended events targeted at women in STEM, you might meet some there.
Wonder what a game would look like if it was created by a design and development team that was predominantly women?
Getting ready to teach biostatistics in a few weeks and it seems to me that the real confusion in most cases is not the calculations, which can be fairly simple, but rather that there can be several ways of looking at the same question. Let’s take “risk” as an example.
What is the “risk” of diabetes?
You could answer this by prevalence – 9.3% of Americans have diabetes. So you could say you have about a 1 in 11 chance of having diabetes. Is that your risk?
On the other hand, incidence, the number of new cases per year is about 1.8 million, which comes out to around 0.6% in a population of 313 million. So, your chance of being newly diagnosed with diabetes is around 1 in 200. Is that your risk?
In discussing risk of a disease, it may be useful to consider the specific population. For example, the CONDITIONAL risk of having diabetes given the condition that your ethnicity is Asian-American of Chinese descent is 4.4 %. (Conditional risk of a disease is defined here as the prevalence given a specific condition.)
Conditional risk given that you are Puerto Rican is 14.8%.
What is the relationship between diabetes and ethnicity?
This is another simple-sounding question that can be answered in multiple ways. First of all, what is your reference group?
Is it, say, Puerto Ricans compared to the total prevalence of 9.3% ? Is it Puerto Ricans compared to non-Hispanic whites? To all Hispanics? To Americans of Chinese descent? If the latter sounds silly, I’m not sure why it is any sillier than non-Hispanic whites, but perhaps someone can enlighten me.
Once you have a reference group, then what do you pick as the method of measuring relationship?
Risk difference is the absolute value of difference in probabilities between two groups.
So, if the probability of someone who is Puerto Rican having diabetes is 14.8%, which it happens to be, and the prevalence of diabetes among Central and South Americans in the U.S. is 8.5% (which it also is), the risk difference is 14.8 – 8.5 = 6.3%.
The relative risk is the risk of one group divided by the risk of the other group. So, the relative risk is 1.74. Rounding it up, you could say that Puerto Ricans are twice as likely to have diabetes as Central or South Americans – which sounds considerably different than that the difference between the two ethnic group risks is .063.
Then there are odds ratios, which I have written about extensively, including here. Proportional attributable fraction, proportional attributable risk.
Well, I can go on for weeks – and will, once class starts.
How to make it all less confusing
Start with this question, “What do you want to know and why do you want to know that?”
If you want to know what the probable demand for insulin will be in the next year, you might care most about prevalence + incidence. If you are interested in predicting diabetes 10 years from now, you might be very interested in differing probabilities within ethnic groups, as some have a much faster rate of growth than others.
If you are interested in screening or prevention, you would be very interested in which groups have the highest incidence.
I’m thinking a fun and useful thing to do for both biostatistics and epidemiology would be to have students make a flowchart with questions like : If you want to know this, then do that.
That’s a couple more posts, at least.
Feel smarter? Want to be even smarter?
Any time I hear someone brag,
“I’ve never used X in my life,”
I automatically assume that whatever it is, they haven’t learned it very well. Just about everything I’ve learned has come in useful, and the better I learned it, the more useful it is.
Take statistics, for example. There is nowhere in my life that knowledge of statistics isn’t helpful. Darling Daughter Number 3 competes in mixed martial arts and I’m the worrying type.
Whenever her next fight is announced, the very first thing I do is check the fight odds. For the one coming up in Brazil, she is a 15-1 favorite. Knowing that makes my stress level go down a little. I’ll still drop by her gym a time or two during camp just to reassure myself that all is going well. As I said, I’m a worrier.
The latest thing I’m worrying about is our Kickstarter campaign, but here again, statistics cheer me up. Two years ago, we did a Kickstarter campaign with a goal of $20,000. I should have researched a bit better in advance because even though Kickstarter touted the 44% success rate that is an average (there’s that knowledge of statistics again). Things that were less likely to get funded were projects seeking over $10,000, game projects and projects not featured on Kickstarter. We fit all three. Pretty depressing. In fact, looking at the statistics after we had started our campaign last time I found that less than 5% of campaigns raised over $20,000.
Well, we made it. You’d think we have learned our lesson, but due to a couple of reasons, I’ll go into another day, we decided to do ANOTHER Kickstarter two years later. So, here we are today.
The bad news is that the success rate on Kickstarter has gone down. The overall success rate is now 39% . The semi-good news is that the success rate for games actually ticked up a bit – it was 33% two years ago and it is 34% now.
The really good news: success tends to be all or nothing – 79% of projects that raised 20% of their goal ended successfully funded. Of projects that raised 41% of their goal, 94% went on to be successfully funded. We’re at 42% and we still have two-thirds of our campaign to run, so I’m feeling somewhat less worried.
There is not nearly enough replication in scientific research. It’s unfortunate that funding agencies and academic journals always want to see a new twist – a different technique, a different population. Personally, I’m very interested in reading studies that say:
“I did the exact same study as Mary Lou Who and I found pretty much the same thing.”
One reason this is interesting is that it controls for the history effect. Maybe a specific event determined the outcome. A second reason I find replication interesting is that people are very quick to generate causal hypotheses to explain relationships after the fact. In a subsequent study, those hypotheses can be assessed. Do they still stand up?
Here is an example that comes up in my personal life a lot. People assume since Darling Daughter Number 3 is on TV and in the movies a lot that it helps my business.
Let’s take a look at the graph below:
This shows website statistics for The Julia Group site. Those lines are average daily visits to this site in months when my little pumpkin had UFC world title fights. I used average daily visits to control for the fact that some months have more days than others. Contrary to expectations, the months when she had fights I had stagnant or declining number of visitors. Hearing this, some of the same people who had suggested her career would have a positive effect on business, without blinking an eye reversed themselves and said it must be because I was distracted and away from the office during those months.
Let’s replicate that graph with data from 2012-2013. You see a pretty similar trend between the top and bottom lines. Over the past couple of years, visits have been rising, so the average daily visitors is higher than in 2012-13 but the pattern is the same – an increase during the months from September to December and fewer visits in the summer months. December 2012 was a little unusual compared to most years – usually there is a drop over the holidays.
Because I see these same trends year after year, I realize it’s not at all attributable to how much Ronda is in the media in a given month. It’s a seasonal trend. Since I write about statistics and programming a lot, I’m pretty sure more people come to this blog during the academic year when they are taking a class. Also, people can read my blog at work and pretend it is work-related, even if I’m just ranting about something that day, because, hey there is a possibility that it COULD be about something relevant.
This assumption is further supported by the fact that the lowest days of the week for website visits are Saturday and Sunday.
It’s also interesting when you don’t find the same thing
If one defines “interesting” as not getting what you want, I had an interesting experience with a research project recently. Replicating the project a second year, we ran into all kinds of technical difficulties and the results were far from significant. In short, the subjects did not receive the planned intervention so no effect of intervention was observed. Much swearing ensued. I’m now analyzing data from the third year of the same project.
Multi-year studies make so much more sense to me and it troubles me that there are not more of them. I understand the reasons. For one thing, there is so much pressure to publish in many institutions that people put out as many articles as they can as quickly as they can (everyone except for YOU, of course). They are expensive and it is hard to justify funding to study something you already supposedly studied and reported the results.
Yeah, I get it, but just like those people who confidently explain my website statistics, without replication it is too easy to be persuaded that one’s first, or completely contradictory second, hypothesis is correct.
I’m giving a talk on Preparing Students for the Real World of Data at SAS Global Forum next month.
You’d think 50 minutes would be long enough for me to talk, but that just goes to show you don’t know me as well as you think you do. One point made in the template for papers is that you should not try to tell every single thing you know about the DATA step, for example, because it will bore your audience to death.
Random Tips That Didn’t Make it Into the Paper
1. CATS removes blanks and concatenates
While I did give a few shout outs to character functions, it was not possible to put in every function that is worth mentioning. One that didn’t make the cut is the CATS function.
The CATS function concatenates strings, removing all leading and trailing blanks.
Let’s say that I want to have each category renamed with a leading “F” to distinguish all of the variables from the Fish Lake game. I also want to add a ‘_’ to problems 10-14 so that when I chart the variables 11 comes just before 12, not before 2 (which is what would happen in alphabetical order). So, I include these statements in my DATA step.
IF problem_num IN(11,10,12,13,14) THEN probname = CATS(‘F’,’_’,probname);
ELSE probname = CATS(game,probname) ;
Now when I chart the results you can see the drop off in correct answers as the game gets more difficult.
2. Not all export files are created equal
Nine of the ten datasets I needed I was able to download as an EXCEL file and open up in SAS Enterprise Guide. It was a piece of cake, as I mentioned last time. Unfortunately, the third file was download from a different site and it had special characters in it, like division signs, and the data had commas in the middle of it. When I opened it up in SAS Studio it looked like this.
Fixing it was actually super simple. This was an Excel file. I simply did a Replace ALL and changed the division signs to “DIV” and the commas to spaces. The whole thing took FIVE lines to read in after that.
3. Listen to Michelle Homes and know your data
filename fred “/courses/abc123add/sgf15/sl_pretest.csv ” ;
Data pretest keyed;
LENGTH item9 $ 38. ;
infile fred firstobs = 2 dlm=”,”;
input started $ ended $ username $ (item1 – item24) ($) ;
Thank you to the lovely Michelle Homes for catching this! As she pointed out in the comments, the input statement assumes that the variables are 8 characters in length and character data. This is true for 26 of the 27 variables. However, ONE of the 24 items on the test is a question that can be answered with something like Four million, four thousand and twelve.
That, as you can see, is over 8 characters. So, I added a LENGTH statement. That brought up another issue, but that is the next post …
I’ll have a lot more to talk about in Dallas. Hope to see you there.
Want to be even smarter? Back us on Kickstarter! We make games that make you smarter. The latest one, Forgotten Trail, is going to be great! You can get cool prizes and great karma.
If you came into my office and watched me work today, just before I had you arrested for stalking me, you might notice me doing some things that are the absolute opposite of best practices.
I need about 10 datasets for some analyses I’ll be doing for my SAS Global Forum paper. I also want these data sets to be usable as examples of real data for courses I will teach in the future. While I’m at it, I could potentially use most of the same code for a research project.
The data are stored in an SQL database on our server. I could have accessed these in multiple ways but what I did was
1. Go into phpMyAdmin and chose EXPORT as ODS spreadsheet.
2. Opened the spreadsheet using Open Office, inserted a row at the top and manually typed the names of each variable.
Why the hell would I do that when there are a dozen more efficient ways to do it?
In the past, I have had problems with exporting files as CSV, even as Excel files. A lot of our data comes from children and adolescents who play our games in after-school programs. If they don’t feel like entering something, they skip it. That missing data has wreaked havoc in the past, with all of the columns ended up shifted over by 1 after record 374 and shifted over again after record 9,433. For whatever reason, Open Office does not have this problem and I’ve found that exporting the file as ODS, saving it as an xls file and then using the IMPORT DATA task or PROC IMPORT works flawlessly. The extra ODS > Excel step takes me about 30 seconds. I need to export an SQL database to SAS two or three times a year, so it is hard to justify trouble-shooting the issue to save myself 90 seconds.
IF YOU DIDN’T KNOW, NOW YOU KNOW
You can export your whole database as an ODS spreadsheet. It will open with each table as a separate sheet. When you save that as an XLS file, the structure is preserved with individual sheets.
You can import your data into SAS Enterprise Guide using the IMPORT DATA task and select which sheet you want to import. Doing this 2, 3 or however-many-sheets-you-have times will give you that number of data sets.
WHY TYPE IN THE VARIABLE NAMES?
Let me remind you of Eagleson’s law
“Any code of your own that you haven’t looked at for six or more months might as well have been written by someone else.”
It has been a few months since I needed to look at the database structure. I don’t remember the name of every table, what each one does or all of the variables. Going through each sheet and typing in variable names to match the ones in the table is far quicker than reading through a codebook and comparing it to each column. I’ll also remember it better.
If I do this two or three times a year, though, wouldn’t using a DATA step be a time saver in the long run? If you think that, back up a few lines and re-read Eagleson’s law. I’ll wait.
Reading and understanding a data step I’d written would probably only take me 30 seconds. Remembering what is in each of those tables and variables would take me a lot longer.
I’ve already found one table that I had completely forgotten. When a student reads the hint, the problem number, username and whether the problem was correctly answered is written to a table named learn. I can compare the percentage correct from this dataset with the rest of the total answers file, of which is a subset. Several other potential analyses spring to mind – on which questions are students most likely to use a hint? Do certain students ask for a hint every time while others never do?
Looking at the pretest for Fish Lake, I had forgotten that many of the problems are two-part answers, because the answer is a fraction, so the numerator and denominator are recorded separately. This can be useful in analyzing the types of incorrect answers that students make.
The whole point of going through these two steps is that they cause me to pause, look at the data and reflect a little on what is in the database and why I wanted each of these variables when I created these tables a year or two ago. Altogether, it takes me less time than driving five miles in Los Angeles during rush hour.
This wouldn’t be a feasible method if I had 10,000,000 records in each table instead of 10,000 or 900 variables instead of 90, but I rather think if that was the case I’d be doing a whole heck of a lot of things differently.
My points, and I do have two, are
- Often when working with small and medium-sized data sets, which is what a lot of people do a lot of the time, we make things unnecessarily complicated
- No time spent getting to know your data is ever wasted