I get asked this question fairly often so I thought I would do a few posts on it. The most common problem is that a student who is new to statistics has no idea where to even start.
These examples use SAS but you could use any package you like.
My recommendation to students beginning to learn statistics is to start with some type of publicly available data set, getting some experience with real data.
1. IDENTIFY THE VARIABLES YOU HAVE AVAILABLE
The first thing to do is examine the contents of the dataset. Look at the variables you have available. With SAS, you would do this with PROC CONTENTS.
Your program at this point is super simple
LIBNAME mydata “path to where your data are” ;
PROC CONTENTS DATA = mydata.datasetname ;
Normally, you would come up with a hypothesis first and then collect the data. The advantage of working with public use data sets is you don’t have to go to the time and expense of interviewing 40,000 people. The disadvantage is that you are limited to the variables collected.
2. GENERATE A HYPOTHESIS
Looking at the California Health Interview Survey data, I came up with the following null hypothesis:
There is no difference in obesity among Caucasians, African-Americans and Latinos.
3. RUN DESCRIPTIVE STATISTICS
You need descriptive statistics for three reasons. First, if you don’t have enough variance on the variables of interest, you can’t test your null hypothesis. If everyone is white or no one is obese, you don’t have the right dataset for your study. Second, you are going to need to include a table of sample statistics in your paper. This should include standard demographic variables – age, sex, education, income and race are the main ones. Last, and not necessarily least, descriptive statistics will give you some insight into how your data are coded and distributed.
proc freq data = mydata.coh602 ;
tables race obese srsex aheduc ;
where race ne “” ;
proc means data= mydata.coh602 ;
var ak22_p srage_p ;
where race ne “” ;
Notice something about the code above – the WHERE statement. My hypothesis only mentioned three groups – Caucasians, African-Americans and Latinos. Those were the only three groups that had a value for the race variable. (This example uses a modified subset of the CHIS , if you are really into that sort of thing and want to know.) Since that is the population I will be analyzing, I do not want to include people who don’t fall into one of those three groups in my computation of the frequency distributions and means.
4. PUT TOGETHER YOUR FIRST TABLE
Using the results from your first analysis, you are all set to write up your sample section, like this
The sample consisted of 38,081 adults who were part of the 2009 California Health Interview Survey. Sample demographics are shown in Table 1.
<Then you have a Table 1>
Variable …………N…. %
- Black 2,181 5.7
- Hispanic ,4926 13.0
- White 30,974 81.3
- Male 15,751 41.4
- Female 22,330 58.6
Variable ……N ….. Mean… SD
Age…………38,081 55.4 18.0
Income 37,686 $69,888 $63,586
I’ll try to write more soon, but for now The Invisible Developer is pointing out that it is past 1 a.m. and I should get off my computer.