Ask me anything: Part 2

Continuing on with questions students asked at the end of the semester …

Note that the following questions were asking what I personally do, and I answered the same way. These are not rules that anyone has to follow, like taking the square root of the variance to find the standard deviation, but they are, I think, generally good advice.

Are there tests/task you always do when analyzing data?

Yes. If using SAS Enterprise Guide I always do the Characterize Data task. If using another software package, I always do descriptive statistics (DESCRIPTIVES in SPSS, PROC CONTENTS, PROC MEANS and PROC FREQ in SAS or in Stata, DESCRIBE, SUMMARIZE & TABULATE ) .

The reason I do this is that I want to get a good look at my data and make sure there are no real problems. If the data are no good, there isn’t much point in going any further without fixing it.

Is there a specific order to the tests you perform when analyzing data ?

Yes. I do descriptive statistics first to check for data quality, do a reality check (see that all of my subjects aren’t the same age, “rutabaga” is not a value entered for race and so on). I also do descriptive statistics to get an idea of what type of sample these data represent. Are they middle school students, older adults, a random sample of the population of California?

Next, I do bivariate statistics, both descriptive and inferential. I want to take a look at the simple relationships first. That way if later on I notice that there is a relationship between living near the coast and SAT scores, before I go off on a tangent about the salt air developing brain cells I am also aware that mean housing prices are significantly higher by the beach, and I consider the possibility that it might just be the same old correlation of SES and academic achievement reported thousands of times over.

Mom and toddler — Trust is a great basis for a relationship. For research, I want data.

Next, if I am going to use any measures, say a scale of attitudes toward innovation, I compute at a bare minimum internal consistency reliability. I say as a bare minimum because you should always be able to get that. All you need are the answers to the individual items. If I don’t have those individual items, I am going to be very uncomfortable using the scale because I am just going on trust that the items were coded correctly and the scale was scored correctly. Trust is a good basis for a personal relationship but I don’t like to base my results on it.

Finally, if appropriate, I would do any multiple regression, logistic regression, survival analysis or other technique using multiple predictor variables.

How do you select what statistics you want to report?

First off, ten points extra credit for knowing that I always run far more statistics than I report. There are two reasons I do this.

First, I run statistics to assess data quality and representativeness of the sample which are not of much interest to the reader. Even if the documentation for a data set swear up and down with pictures of angels that the sample was randomly selected from the population, I am going to at least take a look at whatever sample demographics I have available to determine if it is evenly split by gender, has a distribution by race that approximates the population and any other variables I can imagine. Sometimes, a variable like “whether or not there is a mother in the home” may show up in unexpected ways. More often, everything is exactly as expected and I don’t necessarily report that “the American Community Survey is a representative sample of the state of California, just like the Census Bureau claimed”. So, I generally don’t report statistics measuring quality of the data and sample. If they are positive there’s nothing of interest to say and if negative I need to fix the problem or not proceed with the study.

Second, I run a lot of models and look for convergence. I may run a PROC GLM and PROC MIXED, including school as a random effect, and not at all. I may run a logistic regression where I split campuses by “reported more than ten crimes” and “reported less the ten crimes” and use that as the dependent variable rather than number of crimes. I’m doing these multiple analyses primarily for me. I want to understand what it is going on here, whether major violations of the normality assumption are impacting the results and what would happen if I split the data a different way, such as zero crimes, one or more crimes. What I am NOT doing is running 100 analyses and reporting the five that are significant. On the contrary, I am looking at the same question four or five related ways and expecting all or nearly all to be significant. If every one doesn’t come out significant, I want to know why. The decision to accept or reject the null hypothesis should not be based on using exactly THAT measure, and coding the analysis using precisely THESE options. That, my friend, is not a very robust finding. It reminds me of a comment Dr. Donald MacMillan made when someone asked him what test was best to determine if a child was mentally retarded. He said,

“If you have to give a test to know whether he’s mentally retarded or not, odds are, he isn’t.”

What then do I report?

Sample demographics
Descriptive statistics for all independent and dependent variables used in the analyses
Test reliability and validity data if the measures are not standard ones. (I would not report reliability for the SAT, but for Joe’s Test of Academic Achievement, I would.)
Inferential statistics (assuming I ran any, which I usually did)

Photo by Andrew Lih. License

So, what about inferential statistics, which ones do I report? I think I am in the minority here because I generally choose more familiar, easier to understand statistics. If I had done a regression and a mixed model and they gave me essentially the same results, I’d report the regression. If I did a chi-square with one independent variable and correctly classified 80% of the tofu-eaters and my logistic regression with three variables, all of them significant, correctly predicted 82% of those with a predilection for tofu, I’d go with the chi-square. If my structural equation model didn’t really add a lot compared to doing three multiple regressions, I’d report the regressions. My main objective is to inform my clients, not prove how smart I am.

For me, a technique that is less accessible to the general public is going to have to have some substantial improvement in prediction or clarity (for example, documenting interaction effects) to justify its use over a more basic technique.

So, those are my answers for the day. Tune in next time for a completely unrelated post on why Black in America was awesome, accelerators sound exciting and I don’t think I could ever do one.

Finding Groups in Data

Urban vs Rural Barriers to Ed Tech: An example of Fisher’s Exact Test

Parallel Analysis Criterion Simplified?

How to compute a standard deviation and control chart when you don’t have raw data

Really Teaching Statistics

Scoring tests with SAS: What a difference array makes

One Comment

Leave a Reply

Similar Posts

One Comment

Leave a Reply