I get asked this question fairly often so I thought I would do a few posts on it. The most common problem is that a student who is new to statistics has no idea where to even start.
These examples use SAS but you could use any package you like.
My recommendation to students beginning to learn statistics is to start with some type of publicly available data set, getting some experience with real data.
1. IDENTIFY THE VARIABLES YOU HAVE AVAILABLE
The first thing to do is examine the contents of the dataset. Look at the variables you have available. With SAS, you would do this with PROC CONTENTS.
Your program at this point is super simple
LIBNAME mydata “path to where your data are” ;
PROC CONTENTS DATA = mydata.datasetname ;
Normally, you would come up with a hypothesis first and then collect the data. The advantage of working with public use data sets is you don’t have to go to the time and expense of interviewing 40,000 people. The disadvantage is that you are limited to the variables collected.
2. GENERATE A HYPOTHESIS
Looking at the California Health Interview Survey data, I came up with the following null hypothesis:
There is no difference in obesity among Caucasians, African-Americans and Latinos.
3. RUN DESCRIPTIVE STATISTICS
You need descriptive statistics for three reasons. First, if you don’t have enough variance on the variables of interest, you can’t test your null hypothesis. If everyone is white or no one is obese, you don’t have the right dataset for your study. Second, you are going to need to include a table of sample statistics in your paper. This should include standard demographic variables – age, sex, education, income and race are the main ones. Last, and not necessarily least, descriptive statistics will give you some insight into how your data are coded and distributed.
proc freq data = mydata.coh602 ;
tables race obese srsex aheduc ;
where race ne “” ;
proc means data= mydata.coh602 ;
var ak22_p srage_p ;
where race ne “” ;
Notice something about the code above – the WHERE statement. My hypothesis only mentioned three groups – Caucasians, African-Americans and Latinos. Those were the only three groups that had a value for the race variable. (This example uses a modified subset of the CHIS , if you are really into that sort of thing and want to know.) Since that is the population I will be analyzing, I do not want to include people who don’t fall into one of those three groups in my computation of the frequency distributions and means.
4. PUT TOGETHER YOUR FIRST TABLE
Using the results from your first analysis, you are all set to write up your sample section, like this
The sample consisted of 38,081 adults who were part of the 2009 California Health Interview Survey. Sample demographics are shown in Table 1.
<Then you have a Table 1>
Variable …………N…. %
- Black 2,181 5.7
- Hispanic ,4926 13.0
- White 30,974 81.3
- Male 15,751 41.4
- Female 22,330 58.6
Variable ……N ….. Mean… SD
Age…………38,081 55.4 18.0
Income 37,686 $69,888 $63,586
I’ll try to write more soon, but for now The Invisible Developer is pointing out that it is past 1 a.m. and I should get off my computer.
First off, the good news. You can find all of the papers from SAS Global Forum 2015 online. This is good news if you are anything like me (and you should be, because, let’s face it, I’m awesome) because even if you went to Dallas there were no doubt several papers you wanted to attend scheduled at the same time.
I liked everything I attended but there were two that stood out as really interesting. The first one was …
Taking the Path More Travelled – SAS Visual Analytics and Path Analysis
You can download it here
My idea of path analysis is a series of regression coefficients where you calculate direct and indirect effects. That is not the path analysis discussed in this paper.
He literally means what path did the customer (critter, whatever) take ?
For example, your path in using this website could be you went to the home page then blog page home then the previous entry on the blog.
While websites are an obvious use for this type of path analysis, there could be many others – customer experience in a call center, where people go in a huge department store, migration of humans or animals, path to achieving a job at a start-up.
Drop-off is often of interest in a path analysis – did they fall out of the path before the endpoint you wanted, e.g., sale, employment, customer support problem solved?
You can also look at weight in a path, not only whether they buy a widget but how much money did they spend?
Visual analytics allows for path segmentation. You can combine items or exclude items.
In the example, Schulz discussed using path analysis to see how effective your different online marketing methods are. Since many people will come from typing your name into a search engine, you may want to delete those paths and only include ads from Google adwords, blogher, your corporate website and other paid marketing efforts.
You can click on events and select Exclude to filter out all paths beginning with those events that are not of interest to you.
Sankey diagrams are available in visual analytics. Although these have their origin in uses like energy flow, they are now being applied all over the place.
Here is a sample from Schulz’s paper
A Sankey diagram, FYI, shows the direction and quantity of the flow along a path. There is a blog devoted to Sankey diagrams here.
(This wasn’t mentioned in the paper, I just found that interesting. I’m sure there’s a blog out there devoted to Gantt charts that I could find if I looked, which I didn’t.)
Once the path analysis roles of interest are defined:
… one of the first things to do would be drop the number of paths displayed. Just imagine the mess you would be looking at if you tried to visually display all of the paths someone took in navigating a website with even a few hundred pages.
You can edit the minimum path frequency, e.g., only show a path if at least 250 people took it.
This is just a brief, brief taste of what you can do with path analysis using SAS Visual Analytics and the coolness of SAS Global Forum. There was a lot, lot more and I’ll try to post about the second paper I really liked this week,
but for now perhaps I should quit looking out the window and pay attention in this training session I’m sitting in at Fort Berthold (don’t tell Bruce I wrote my blog during it).
If you missed out on SAS Global Forum, you don’t need to wait until next year for your fix of networking, instruction (and possibly drinking). You can go to the Western Users of SAS Software conference in San Diego in September.
Great! You are using SAS Studio. It’s free. Even greater. You cleaned your data, created subscales. You have this perfect dataset and now, you want to save that dataset to your desktop and maybe do some more work with it, or just open up and admire it – who am I to judge?
Follow these three easy steps:
1. Right-click on the data set.
2. Select Export and export it to one of your folders as, say, a .csv file.
3. Go to that file, select it, and click download.
Are you kidding me?
If you are a programmer, analyst, statistician, professor or student who uses SAS this is an opportunity to get to know your people and to get known.
I’m in Dallas for the SAS Global Forum, which I try to attend whenever I can. Yes, I could watch videos on the Internet, read books, read web pages, but I often don’t because I have a to-do list a mile long.
By presenting at the conference, I have to review what I am doing in teaching with SAS Studio and why.
SHAMELESS PLUG: My session on Preparing Students for the Real World with SAS Studio is a good one for both anyone who teaches with SAS and for anyone who is new to the SAS world and wants a good introductory session.
Since I am at the conference, I have a little bit of downtime to look into SAS resources. My new favorite is SAS communities. It’s a combination forum and free library. I must have looked into it at some point, because I had an account, but it seems to be more active now. I even submitted an article and poked around in the forum.
Then, of course, there are all of the sessions that I will attend, conversations I will have with people, books I will hear about and buy, to read on the plane ride home.
It’s a week of learning.
But , but, you stutter like a motor boat, it’s expensive and far away. I can’t afford it. Besides, I would feel uncomfortable presenting at the same conference with all of those people who wrote the books on SAS (literally).
The expensive part I get. The not feeling like you could present at the same conference part is just silly, so I’m going to pretend you didn’t say that.
If travel and cost is an issue, present at your local conference. The call for papers for the Western Users of SAS Software (WUSS) is open. Do it now!
It is painless. You submit a 300-word abstract. You can submit a working draft of the paper at the same time. That’s not mandatory but it improves your chances.
There is even a mentoring program where old people (like me), will help you revise your program and get ready to present.
Writing and presenting the paper will force you to think about what you are doing and why. You will likely make some contacts of people who will be potential employers, collaborators or drinking buddies.
What are you waiting for? A personal invitation?
Fine! Here you go.
Need a topic? Here are 10 I would like to see
- The 25 functions I use most.
- Uses of PROC FORMAT .
- Multinomial logistic regression.
- The many facets of PROC FREQ.
- Factor analysis
- SAS for basic biostatistics
- Macro for data cleaning
- Model selection procedures
- Mixed models vs PROC GLM
- SAS Graphs without SAS/Graph (because SAS/Graph appears to be written in Klingon)
My point is that if I sat here and thought of 10 off the top of my head after two glasses of Chardonnay and half a glass of the champagne someone who will remain nameless bought at Costco and brought here from a state in the WUSS region, then I’ll bet you could come up with something really awesome stone-cold sober and given more than 60 seconds.
Let’s recap what we have learned here, shall we?
- Join SAS communities,
- Attend conferences, whether national or global,
- Don’t be a wallflower – present!
- Texas steak and wine is a good combination (not particularly related to SAS but true nonetheless)
It has been pretty well established that I am the worst soccer mom in the history of soccer moms. Most of the games I miss because I am somewhere else. My children have told me that my autobiography should be entitled, “I was out of town at the time” because most of the stories of their childhood begin this way.
Having come back in town shortly before the game this weekend, I was unaware that it was a two-day tournament 2 1/2 hours from home and that we were supposed to have reserved the hotel weeks ago. Hot tip: If you get your reservation last minute and have the choice of a close hotel or a nice hotel, get the nice one.
I fulfilled my obligation. I showed up. During the time The Spoiled One played, I watched. During half time and the breaks between the games I was able to write a couple of blog posts and test out SAS Studio.
If you look at the picture above you might see that I was working in a field surrounded by mountains. Not the best situation for Internet access, which I had via the hotspot on my iPhone.
I was able to log on to SAS Studio with no problem. When I logged in on my iPad I had the screen shown above where I could just start typing my program in the code window.
To see folders, libraries, etc. tap the BROWSE link in the top left corner, as shown
You can tap any of the categories to bring down the list of folders, libraries, etc. You can tap on a file to open it.
The one problem I did have, and depending on your situation, it may be a severe one, was that I could not get any of the libraries to open. I wanted to open the sashelp library and see if I could run some tasks using an open data set. This did not work. It is very possibly related to poor Internet due to laying in a soccer field ringed by mountains. I tried it last year in a movie theater and I was able to access the libraries. In this case, as you might guess from the top photo, the Internet was barely accessible.
Next, I tried simulating a homework problem a student might have, just typing in some data and running the program.
I have a bluetooth keyboard I use with my iPad and it all worked fine. I typed in data, tapped on the little running guy and my program ran fine. You can see the results below.
To save it, I held down the home button and the power button simultaneously, just like any time you take a screenshot on an iPad. Then, I emailed that screenshot to myself, so here you have results.
My point is that a student could do their homework using SAS Studio in the middle of a soccer field on an iPad, as long as it did not require external files, which most of the homework I assign does not. They could then email the results to their professor, still from the (dis)comfort of the field.
This is useful to know for three reasons:
- I travel frequently to areas where there is very limited bandwidth,
- Many of the students in my online courses live in areas with limited bandwidth,
- The Spoiled One’s team won their bracket in the State Cup, so it turns out that means they have more soccer games next weekend as they advanced in the tournament. This is not at the same field surrounded by mountains. It’s at a different field at the edge of the desert. Sigh.
Your students should be able to use SAS Studio almost anywhere, even if all they have is an iPad.
This is doubly true if you don’t assign homework that requires accessing external datasets.
I’ll be able to review homework assignments for the course I am teaching next during the soccer tournament this weekend. (I really AM the worst soccer mom in the history of ever.)
—————— SHAMELESS PLUG
Our Kickstarter campaign is still going on, making adventure games to make math (and history and English) awesome.
We are 84% of the way to our goal!
I’m giving a talk on Preparing Students for the Real World of Data at SAS Global Forum next month.
You’d think 50 minutes would be long enough for me to talk, but that just goes to show you don’t know me as well as you think you do. One point made in the template for papers is that you should not try to tell every single thing you know about the DATA step, for example, because it will bore your audience to death.
Random Tips That Didn’t Make it Into the Paper
1. CATS removes blanks and concatenates
While I did give a few shout outs to character functions, it was not possible to put in every function that is worth mentioning. One that didn’t make the cut is the CATS function.
The CATS function concatenates strings, removing all leading and trailing blanks.
Let’s say that I want to have each category renamed with a leading “F” to distinguish all of the variables from the Fish Lake game. I also want to add a ‘_’ to problems 10-14 so that when I chart the variables 11 comes just before 12, not before 2 (which is what would happen in alphabetical order). So, I include these statements in my DATA step.
IF problem_num IN(11,10,12,13,14) THEN probname = CATS(‘F’,’_’,probname);
ELSE probname = CATS(game,probname) ;
Now when I chart the results you can see the drop off in correct answers as the game gets more difficult.
2. Not all export files are created equal
Nine of the ten datasets I needed I was able to download as an EXCEL file and open up in SAS Enterprise Guide. It was a piece of cake, as I mentioned last time. Unfortunately, the third file was download from a different site and it had special characters in it, like division signs, and the data had commas in the middle of it. When I opened it up in SAS Studio it looked like this.
Fixing it was actually super simple. This was an Excel file. I simply did a Replace ALL and changed the division signs to “DIV” and the commas to spaces. The whole thing took FIVE lines to read in after that.
3. Listen to Michelle Homes and know your data
filename fred “/courses/abc123add/sgf15/sl_pretest.csv ” ;
Data pretest keyed;
LENGTH item9 $ 38. ;
infile fred firstobs = 2 dlm=”,”;
input started $ ended $ username $ (item1 – item24) ($) ;
Thank you to the lovely Michelle Homes for catching this! As she pointed out in the comments, the input statement assumes that the variables are 8 characters in length and character data. This is true for 26 of the 27 variables. However, ONE of the 24 items on the test is a question that can be answered with something like Four million, four thousand and twelve.
That, as you can see, is over 8 characters. So, I added a LENGTH statement. That brought up another issue, but that is the next post …
I’ll have a lot more to talk about in Dallas. Hope to see you there.
Want to be even smarter? Back us on Kickstarter! We make games that make you smarter. The latest one, Forgotten Trail, is going to be great! You can get cool prizes and great karma.
If you came into my office and watched me work today, just before I had you arrested for stalking me, you might notice me doing some things that are the absolute opposite of best practices.
I need about 10 datasets for some analyses I’ll be doing for my SAS Global Forum paper. I also want these data sets to be usable as examples of real data for courses I will teach in the future. While I’m at it, I could potentially use most of the same code for a research project.
The data are stored in an SQL database on our server. I could have accessed these in multiple ways but what I did was
1. Go into phpMyAdmin and chose EXPORT as ODS spreadsheet.
2. Opened the spreadsheet using Open Office, inserted a row at the top and manually typed the names of each variable.
Why the hell would I do that when there are a dozen more efficient ways to do it?
In the past, I have had problems with exporting files as CSV, even as Excel files. A lot of our data comes from children and adolescents who play our games in after-school programs. If they don’t feel like entering something, they skip it. That missing data has wreaked havoc in the past, with all of the columns ended up shifted over by 1 after record 374 and shifted over again after record 9,433. For whatever reason, Open Office does not have this problem and I’ve found that exporting the file as ODS, saving it as an xls file and then using the IMPORT DATA task or PROC IMPORT works flawlessly. The extra ODS > Excel step takes me about 30 seconds. I need to export an SQL database to SAS two or three times a year, so it is hard to justify trouble-shooting the issue to save myself 90 seconds.
IF YOU DIDN’T KNOW, NOW YOU KNOW
You can export your whole database as an ODS spreadsheet. It will open with each table as a separate sheet. When you save that as an XLS file, the structure is preserved with individual sheets.
You can import your data into SAS Enterprise Guide using the IMPORT DATA task and select which sheet you want to import. Doing this 2, 3 or however-many-sheets-you-have times will give you that number of data sets.
WHY TYPE IN THE VARIABLE NAMES?
Let me remind you of Eagleson’s law
“Any code of your own that you haven’t looked at for six or more months might as well have been written by someone else.”
It has been a few months since I needed to look at the database structure. I don’t remember the name of every table, what each one does or all of the variables. Going through each sheet and typing in variable names to match the ones in the table is far quicker than reading through a codebook and comparing it to each column. I’ll also remember it better.
If I do this two or three times a year, though, wouldn’t using a DATA step be a time saver in the long run? If you think that, back up a few lines and re-read Eagleson’s law. I’ll wait.
Reading and understanding a data step I’d written would probably only take me 30 seconds. Remembering what is in each of those tables and variables would take me a lot longer.
I’ve already found one table that I had completely forgotten. When a student reads the hint, the problem number, username and whether the problem was correctly answered is written to a table named learn. I can compare the percentage correct from this dataset with the rest of the total answers file, of which is a subset. Several other potential analyses spring to mind – on which questions are students most likely to use a hint? Do certain students ask for a hint every time while others never do?
Looking at the pretest for Fish Lake, I had forgotten that many of the problems are two-part answers, because the answer is a fraction, so the numerator and denominator are recorded separately. This can be useful in analyzing the types of incorrect answers that students make.
The whole point of going through these two steps is that they cause me to pause, look at the data and reflect a little on what is in the database and why I wanted each of these variables when I created these tables a year or two ago. Altogether, it takes me less time than driving five miles in Los Angeles during rush hour.
This wouldn’t be a feasible method if I had 10,000,000 records in each table instead of 10,000 or 900 variables instead of 90, but I rather think if that was the case I’d be doing a whole heck of a lot of things differently.
My points, and I do have two, are
- Often when working with small and medium-sized data sets, which is what a lot of people do a lot of the time, we make things unnecessarily complicated
- No time spent getting to know your data is ever wasted
Some people may have said that hackathons are a stupid ass idea where a bunch of people who have can’t afford to buy their own pizza spend 48 hours with a bunch of strangers and no showers.
Okay, well, maybe that was me.
I take it all back.
We kicked off our hackathon at noon on Monday and wrapped up at 8 pm on Tuesday. The rules were simple – everyone who was working those days was to wipe their schedule completely for 8 hours each day and do nothing but work on the game. No emails, no blog posts, no meetings except for a kick off meeting each day to assign and review tasks. Jessica, Dennis, Samantha and I worked on the game for (at least) 16 hours. Any emails or interviews got done before the hackathon hours or after they were over. (I did pause for a brief interview with the Bismarck State College paper.)
Maria came in from maternity leave and worked 8 hours on Monday, baby in tow.
Gonzalo and Eric each worked their regular shifts on Monday and Tuesday, respectively, doing nothing but writing code, creating sprites and editing audio. Sam even pitched in a few hours early in the morning from Canada. Our massively talented artist, Justin, completed all of the new artwork before the meeting so we had it in hand to drop into all of the spots where there had been placeholders.
So, in two days a total of 100 hours were devoted just to game development. We made a giant leap forward.
Why did it work so well? For one thing, we were all in the same spot for a long time. Although the original plan was to meet and then people would go there separate ways, on Monday, five of the six people working stayed at my house. Three of us even slept there. That had two positive impacts.
First of all, whenever anyone needed something, whether it was a piece of artwork modified or a question answered on whether we had a sound file of footsteps in the woods or to be shown how to do a voice over in iMovie, there was someone else to provide that assistance right on the spot. Very often, you can spend hours searching for something on Google, watching youtube videos, reading manuals trying to figure out how to do X when someone else can come up and say – Click on Window, pick record voiceover, click on the microphone in the middle of the left side of the window.
There are also those questions that CANNOT be found on Google, like where the hell was the new background image saved and what is it called.
The second positive impact was we got around to tasks that needed doing for a long time. While it may have seemed it kept us from getting real progress done on the game, the fifth time Sourcetree complained about not tracking those damned Dreamweaver .idea files, I HAD it and we removed those from the repository forever. When something bugs you every now and then you may think, “I’ll do it later”, but the fifth time it happens that day …
Anyway, I would share more of the awesomeness of the hackathon experience with you but it is now 9 pm and we are taking the team out for sushi.
In case you don’t know, SAS On-Demand is the FREE , as in free beer, offering of SAS for academic use. How good is it? There really can’t be one answer to that.
First of all, there are multiple options – SAS Studio, SAS Enterprise Miner, SAS Enterprise Guide, JMP, etc. so some may be better than others.I have a fair bit of experience with two of them, so let’s just look at one of those today.
I mostly use SAS Studio with my students and over the past few courses I have been really pleased with the results. I selected SAS Studio over Enterprise Guide because I strongly believe it is useful for students to learn to code and many students, yes, even in an area like biostatistics need a little encouragement to learn. While they don’t end up expert SAS programmers after two or three courses, they at least can code a DATA step , read in raw data, aggregate data and data from external files, produce a variety of statistics and graphics and interpret the results.
Let’s be frank about this … it’s going to require a bit of work up front. You need to create a course with SAS On-Demand. You need to notify your students that they need to create accounts. If you are not going to use solely the sashelp directory data sets, you’re going to have to upload your own data.
Please don’t tell me you plan on solely using the sashelp data sets! These are really helpful for the first assignment or two while students get their feet wet but unless you expect your students to have careers where all of their files to be analyzed are going to be shipped with the software they use, you’re going to move to reading in other types of data sooner or later.
Your data are going to be stored on the SAS server (so you can tell people who ask that yes, you are ‘computing in the cloud’ – instead of what I usually tell people who ask stupid questions like that, which is shut the hell up and quit bothering me – but I digress. Even more than usual.)
No matter what software you use, you’re going to have to select some data sets for students to analyze, have some sort of codebook and make sure your data is reasonably clean (but not so clean that students won’t learn something about data quality problems). So, the only real additional time is figuring out how to get it on the SAS server.
None of these steps take much time, but adding them all up – getting a SAS profile, creating a course, creating an email to send to all of your students, with the correct LIBNAME, uploading your data – it all maybe adds up to a couple of extra hours.
My challenge always is how I shoehorn additional content into the very limited class time I have with students. One tool I’ve been using lately is livebinders. This is an application that lets you put together an online binder of web pages, videos and material you write yourself.
Here is an example of a livebinder I use for my graduate course in epidemiology. It has SAS assignments beginning with simply copying code to modifying it . Links to the relevant SAS documentation are included, as are videos that show step by step how to use SAS Studio for computing relative risk, population attributable risk, etc. I have a similar livebinder for my biostatistics course.
You might think this is a bit of hand-holding to walk the students through it, but I would disagree. Every time I have found myself thinking,
“Well, this is a little too easy”,
I have been wrong.
If you have been doing something for a decade or, in my case, a few decades, it’s hard to remember how confusing concepts were the very first time. Even things that you do automatically, like downloading your results as an HTML file, were a mystery at one time in your life. Making the videos takes some time initially – you have to do a screencast, and then the voice over. Sometimes I do them at once, using QuickTime and GarageBand simultaneously. Other times, I import the screencast into iMovie and record a voiceover.
Either way, a 7-minute video usually takes me half an hour to record, when you add in screwing up the first time, editing out the part where The Spoiled One came in and asked for money to go shopping, etc. So, you’re adding maybe 3-4 hours to the time you spend on your course. On the other hand, you only have to do it once, so, if you teach the same course a few times, it pays off. I cannot tell you how many times students tell me that the videos were helpful. Unlike when I am lecturing in class, they can slow the video down, play it over.Students end the course with experience coding, using data from actual studies and interpreting data to answer problems that matter.
My point is, that it is a little more work to teach using SAS Studio, but it is worth it.
Kappa is a useful measure of agreement between two raters. Say you have two radiologists looking at X-rays, rating them as normal or abnormal and you want to get a quantitative measure of how well they agree. Kappa is your go-to coefficient.
How do you compute it? Well, personally, I use SAS because this is the year 2015 and we have computers.
Let’s take this table, where 100 X rays were rated by two different raters as an example:
Rating by Physician 1
————-Abnormal | Normal
Abnormal 40 20
Normal 10 30
So ….. the first physician rated 60 X-rays as Abnormal. Of those 60, the second physician rated 40 abnormal and 20 normal, and so on.
If you received the data as a SAS data set like this, with an abnormal rating = 1 and normal = 0, then life is easy and you can just do the PROC FREQ.
and so for 50 lines.
However, I very often get not an actual data set but a table like the one above. In this case, it is still relatively simple to code
DATA compk ;
INPUT rater1 rater2 nums ;
1 1 40
1 0 20
0 1 10
0 0 30
So, there were 40 x-rays coded as abnormal by both rater1 and rater2. When rater1 = 1 (abnormal) and rater2 = 0 (normal), there were 20, and so on.
The next part is easy
PROC FREQ DATA = compk ;
TABLES rater1*rater2/ AGREE ;
WEIGHT nums ;
That’s it. The WEIGHT statement is necessary in this case because I did not have 100 individual records, I just had a table, so the WEIGHT variable gives the number in each category.
This will work fine for a 2 x 2 table. If you have a table that is more than 2 x 2, at the end, you can add the statement
TEST WTKAP ;
This will give you the weighted Kappa coefficient. If you include this with a 2 x2 table nothing happens because the weighted kappa coefficient and the simple Kappa coefficient are the same in this case.
See, I told you it was simple.