I’ll be speaking about being a statistical consultant at SAS Global Forum in D.C. in March/ April. While I will be talking a little bit about factor analysis, repeated measures ANOVA and logistic regression, that is the end of my talk. The first things a statistical consultant should know don’t have much to do with statistics.

A consultant has paying clients.

In History of Psychology (it was a required course, don’t judge me) one of my fellow students chose to give her presentation as a one-woman play, with herself as Sigmund Freud. “Dr. Freud” began his meeting with a patient discussing his fee. In fact, Freud did not accept charity patients. He charged everyone. There’s a winning trivial pursuit fact for you.*

Why am I starting with telling you this? Because I have had plenty of graduate students whose goal is “to be a consultant” but they seem to think their biggest problem when they start out is going to be whether they should do propensity score matching using the nearest neighbor or caliper method.

Here are the biggest problems you’ll face:

  • Getting your first clients
  • Getting paid
  • Getting your data into shape
  • Communicating results to your clients.

Let’s start with getting clients. I can think of four ways to do this; referrals, as part of a consulting company, through your online presence and through an organization. I’ve done three of them. First, and most effective, I think, is through referrals. I got my first two clients when professors who did consulting on the side recommended me. I do this myself. If someone can’t afford my fees or I am just booked at the moment, I will refer potential clients to either students, former students or other professionals I know who are getting started as a consultant. It’s not competing with my business. I am never going to work for $30 an hour again and if that’s all that’s in your budget, I understand. If all you need is someone to do a bunch of frequency distributions and a chi-square for you, you don’t need me, although I’m happy to do it as a part of a larger contract.

Lesson number one: Don’t be a jerk.

Referrals mean I’m using my own reputation to help you get a job and so I’m going to refer students who are good statisticians and who I think will be respectful and honest with the client. Don’t underestimate the latter half of that statement.

Lesson number two: It helps if you really love data analysis.

I’d be the first to say that I’m a much nicer person now than when I was in graduate school. Yes, it took me a while to learn lesson one, I am embarrassed to say. However, I really did love statistics and if any of my fellow students had trouble, I was the first person they asked and I was really happy to help. When those students later became superintendents of schools or principal investigators of grants, they thought of me and became some of my earliest clients. Some of my professors also became clients, although those were after I’d had several years of experience.

Lesson number three: Don’t think you are smarter than your clients.

A young relative, who has a Ph.D. In math asked me, “No offense but isn’t what you do relatively easy, like anyone who understood statistics could do it? Why are you so in demand?”
Corollary to this lesson: If you find yourself saying, “No offense” just stop talking right then.

One reason a lot of want-to-be consultants go bankrupt or have to find another line of work is they do think they are smarter than their clients. This manifests itself in a lot of ways so we’ll return to it later, but one way is that they charge much more than the work is worth.

How do you know how much your work is worth?

Lesson number four: Ask yourself, if I had twice as many grants/ contracts as I could do and I was paying someone to do this work, what would I be willing to pay?

That’s a good place to start.

I’ve met a lot of people over the years who charged much more than me and bragged to me about it. In the long run, though, I’m sure I made a lot more money. Clients talk. They find out that you are charging them three times as much as their friend down the block is getting charged by their consultant. You may think you’re getting away with it, but you won’t. You may get paid on those first few contracts but you’ll have a very hard time getting work in the future.

Lesson number five: Know multiple languages, multiple packages

I’ve had discussions with colleagues on whether it is better to be a generalist or a specialist.

I have had a few jobs where they just needed propensity score matching or just a repeated measures ANOVA but those have been the small minority over the past 30 years.

I would argue that even those who consider themselves specialists actually have a wide range of skills. Maybe they are only an expert in SAS but that includes data manipulation, macros, IML and most statistical procedures.

In my case, I would not claim to be the world’s greatest authority on anything but if you need data entry forms created in JavaScript/HTML/CSS, a database back end with PHP and MySQL, your data read into SAS, cleaned and analyzed in a logistic regression, I can do it all from end to end. No, I’m not equally good at all of those. It’s been so long since I used Python, that I’d have to look everything up all over again.

I’ve used SPSS, STATA, JMP and Statistica, depending on what the client wanted. I think I might have even had a couple of clients using RapidMiner. For the last few years, though, the only packages I’ve used have been SAS and Excel. Why Excel? Because that’s what the clients were familiar with and wanted to use and it worked for their purposes. (See lesson three.)

I was really surprised to read Bob Muenchen saying SPSS surpassed R and SAS in scholarly publications. Almost no one I know uses SPSS any more, but, of course, my personal acquaintances are hardly a random sample. I suppose it depends on the field you are in.

I have never used R.

Some people think this is a political statement about being a renegade. Others think it’s because I’m too old to learn new things or in subservience to corporate overlords or some other interesting explanation. (The Invisible Developer, who has been reading over my shoulder, says he never got past C, much less D through Q.)

Since I fairly often get asked why not, I will tell you the real reasons, which is a complete digression but this is my blog so there.

  1. In my spare time that I don’t have, I teach Multivariate Statistics at a university that uses SAS. Since I’m using SAS in my class anyway and need real life data for examples, when a client has licenses for multiple packages and doesn’t care what I use (almost always the case), I use SAS.
  2. About the time that R was taking off, my company was also taking off in a different direction. The Invisible Developer and I own the majority of 7 Generation Games which is an application of a lot of the research done by The Julia Group. When we started developing math games, we needed to learn Unity, C#, PHP, SQL, JavaScript, HTML/CSS. We also needed to analyze the data to assess test reliability, efficacy, etc. I called the analysis piece and told The Invisible Developer I was interested in all of it so I’d do whatever was left. He was really interested in 3D game programming so he did the Unity/C# part. I did everything else. Then, after a few years, I moved to Chile, where the language I most had to improve was my Spanish.
Games in Spanish, English and Lakota

It worked out for me. We have a dozen games available from 7 Generation Games and now we’re coming out with a new line on decision-making.

I mention all this because I want to emphasize there isn’t a single path to succeeding as a consultant. There isn’t a specific language or package you have to learn. There is one thing you absolutely must have, though, and that’s the next post.

* (See Warner, S. L. Sigmund Freud and Money. (1989) Journal of the American Academy of Psychoanalysis. Winter;17(4):609-22)

Anyone who uses SAS (or doesn’t) probably has their own reasons. I have a few but a major one is the ease of importing just about any type of data.

Mo’ clients, Mo’ problems

There are multiple types of consultants. I’m the type who is, literally, all over the map. I’ve been in five countries this year and I think 11 states plus the District of Columbia, but I might have left off a couple. I said 9 in a post on a different blog where I occasionally write about my life and judo, but then I remembered I’d been in Texas for SAS Global Forum where I gave a talk on biostatistics and also in New Mexico speaking on transition from school to work for tribal youth with disabilities.

What that means is that I work with a wide range of organizations and their data is not all in the same format.

If you work with a wide range of clients, ease of data import matters

If you’re a consultant who works consistently with one client, data formats may not be your biggest issue. You probably wrote a program to read in that data, no matter what messy format it was in and you’re good to go. In my case, though, every dataset, every project is different.

All the data, all the time

In the previous post, I mentioned reading in the IPEDS data, which is a relatively small public data set (around 7,000 x 60). Fantastically, that came with a SAS program so all I needed to do was upload the raw data file and change the INFILE statement.

Proc import does not a consultant make

Maybe when you were a student you imported your data sets by a PROC IMPORT step. This isn’t terrible. You should use this procedure when you can. However, you’re going to need to go several steps further.

Even worse, if you’ve been getting your data by simply using the LIBNAME statement your professor provided you or doing some pointy-clicky thing with SAS Studio or Enterprise Guide (or SPSS) you have a lot to learn.

Every year, I have graduate students who tell me they are going to become consultants. More often than not, I shake my head and think,

“You have no idea what you are getting into.”

– Me

If you are going to be working as a statistical consultant for a variety of clients, far more than PROC LOGISTIC or PROC GLIMMIX, your time is going to be spent in the DATA step.

It’s not just a matter of data formatting or missing data, but of creating the data you need that isn’t there. What do I mean by that? Ha ha, that is a future blog post that I may write next time I’m on a plane somewhere and have a spare moment. Probably tomorrow.

First of all, I want to draw your attention to this retraction in the Journal of the American Medical Association and mad props to Drs. Aboumatar and Wise and John Hopkins for doing the right thing in publicly retracting it.

For the TL; DR crowd

Someone who is probably now unemployed miscoded the study groups in this randomized clinical trial of self-management of Chronic Obstructive Pulmonary Disease. What does that mean? In this case, it meant that the reported results were the exact opposite of what was really observed because the treatment groups were coded incorrectly. Also, read the seven tips at the end of this post.

When I talk about statistical analysis, I focus 80% or more of my time and attention on the basics of knowing your data, cleaning your data and examining your data some more. To some, mostly younger, statisticians, that is not the sexy stuff. Why am I not talking about neural nets or generalized linear mixed models? Don’t I know that improving your prediction by .3% can result in millions of dollars in profit for a corporation that has 38 million customers?

What I know is that problems like the one in that JAMA article occur more often than we like to admit.

Recently, a student sent thesis results and then the next day sent an email saying, “Oops, I meant to use the DESCENDING option in PROC LOGISTIC but I didn’t, so the results are the exact opposite of what I said.”

A couple of years ago, I did an analysis with a depression scale for which the standardized coding is 0 to 3, but the application had used 1 to 4. The first analysis showed that every single person in the sample was clinically depressed. Fortunately, I caught this before it was published. Even when I re-analyzed the data with the correct scoring the mean score was extremely high. This was not a random sample of the population, but rather, children with a family member addicted to methamphetamine. The original (incorrect) analysis wasn’t in the opposite direction but it did somewhat overstate the problem.

Several years before that, I worked for a client who had a previous consultant with no knowledge of their particular field but who was a very good programmer. In reviewing some of that person’s code to understand the data and how it had been scored, I found that NONE of the items that should have been reverse-coded had been. The consultant had simply taken the sum of all of the items. This research had been published, by the way. I mentioned this to the client and suggested that a retraction was in order. That retraction never happened and I never worked for that client again.

My Six Tips for Saving Your Ass

  • Learn to code. I don’t mean you need to be the greatest SAS/ R/ Python whatever guru in the world but you should be able to read through the code someone else wrote and understand it. This means you should be able to read an IF-THEN statement, a loop re-coding all the items in an array and the statistical procedures used in your analysis.
  • Understand that the DESCENDING option in PROC LOGISTIC means that the probability modeled is reversed. So, by default, PROC LOGISTIC models the probability of response levels with lower Ordered Value, and if you have death (coded 0= lived, 1= died) as the dependent, the procedure is predicted who lived. If you use the DESCENDING option, it’s going to predict who died.
  • Know how many people should be in each group; control, experimental condition 1, experimental condition 2. Do a PROC FREQ and see if it matches what you expect.
  • Know the range for each item in your analysis and do a PROC MEANS with mean, minimum, maximum and standard deviation. Even if you have 500 or 600 variables it shouldn’t take you all that long to scan through that many lines and see if anything is out of range.
  • Know which items should have been reverse-coded and check if that was done.
  • Compute reliabilities for each scale in an analysis. While the reliability would not have been changed in the depression example where 1 was added to every response, it would have picked up those cases where the variables were not re-coded by showing very low reliabilities.

A seventh, extra bonus tip

If you can’t understand the code that someone has written, not because you are a moron (can’t help you there), but because they are one of those people who never write comments in code, don’t believe in documentation and write code that includes an unnecessary number of macro variables, user-written macros and overly complicated solutions, fire their sorry ass and hire someone less pompous. I’m not saying you shouldn’t have macros or that because a person uses a DATA step and you prefer PROC SQL you should get rid of them. What I am saying is if you ask a person what decisions they made in writing that code and what was the reason for, say, using a generalized linear model instead of a general linear model, they should be able to tell you.

The famous statistician, F.N. (for Florence Nightingale) David was a professor at UC Riverside, where I earned my doctorate. My advisor told this story about her:

We were on this dissertation committee – I forget if it was for biology or what, back then, this was a small campus so if you were in statistics you could end up on any committee. So, he gets to the end of his defense, and F.N. David pulls the cigar out of her mouth and says,

“Young man, you believe your numbers far too much.”

The point Dr. Eyman was trying to make to me was that even if you have done every single computation perfectly …

“The government are very keen on amassing statistics. They collect them, add them, raise them to the nth power, take the cube root and prepare wonderful diagrams. But you must never forget that every one of these figures comes in the first instance from the village watchman, who just puts down what he damn pleases.”

– Josiah Stamp

What is a conscientious statistical consultant to do?

Start with getting to know your data better than God knows the Bible. Let’s start with analyzing secondary data, for example, IPEDS, that has already been collected. I’ll talk about collecting your own data later. Let me just put in a plug for doing it electronically if possible. Also, make sure your data entry staff know which is the intervention and which is the control group. (You think I’m kidding but I’m not.)

Secondary data analysis: Read the documentation!

You think that is obvious, do you? IPEDS is the Integrated Postsecondary Education Data System, collected by the National Center for Education Statistics. It is my favorite type of data set and the type you almost never get. It includes pretty much the entire population of interest.

If you don’t know these things, you don’t know your data:

  • Is it a sample or the entire population?
  • If it’s a sample, what proportion of the population was sampled and how? Randomly? Stratified random?
  • Does the data set have sampling weights? What is the variable name for those weights (You’re going to use them, aren’t you? Please say yes.)
  • How were the data recorded?

This isn’t all you need to know. We’ll talk about specific variables next.

One reason I like IPEDS is that you can be pretty sure everyone reported data because it’s mandatory for any institution who gets federal financial aid. It also includes the U.S. service academies, which are about the only post-secondary institutions who don’t. It also gives you a SAS program for reading the data after you upload it. There are also SPSS and STATA programs.

Another thing I liked about IPEDS is it is, inside and out, one of the best documented data sets I’ve seen. I’d recommend it as an example of how to do things if you are going to be creating data sets for secondary analysis yourself. Don’t get used to it, though, because most of what you’ll find in your career is far worse than this. Here is just a simple example from one data set.

*** Created:    October 2, 2018                                ***;
 *** Modify the path below to point to your data file.        ***;
 ***                                                          ***;
 *** The specified subdirectory was not created on            ***;
 *** your computer. You will need to do this.                 ***;

If you want to analyze it using SAS Studio, now you know that once you’ve uploaded the data, you do need to change the INFILE statement. If you don’t know the full path, ctrl-click (Mac) or right-click (Windows) on the data file and select PROPERTIES

Select Properties to get the path to your file

Change the INFILE statement to what you see in the path, so now it looks like this

infile '/home/your_directory/IPEDS/hd2017.csv' delimiter=',' DSD MISSOVER firstobs=2 lrecl=32736;

You won’t necessarily have the delimiter, etc. It depends on your file. Okay, run it, you have data. Awesome!

When I run frequencies for the IPEDS data, I get 7,153 institutions but the IPEDS methodology report says there are 6,642. What the hell? Looking through the data, I find that 287 institutions were closed in either 2017 or prior. Another 38 were combined with another institution or not to be include for some unspecified reason “out of scope”. There were 41 that were “not primarily post-secondary institutions”, so I dropped those also. Since I’m only interested in individual, active institutions for the research I’m doing, I’m dropping those.

There were 88 institutions that were new in 2017 or had their Title IV (financial aid) eligibility restored. After debating back and forth, I decided to drop those, too. My interest is in developing a baseline of enrollment and retention, which these new institutions will only have for one year.

My point is that I’ve gotten one of the best data sets you could ever find and 7% of the data is inappropriate for my purpose. Does it matter as long as 93% of the data are correct? Well, I definitely think that my results would be less accurate.

My second point is that there is not anything “wrong” with the IPEDS data. I can imagine plenty of circumstances in which one would want to have the data on closed institutions.

These may seem like details, but I am pretty convinced that if you are not a “detail person” you are never going to make it in the long run as a statistical consultants. These details add up fast.

One last thought – if 7% of the data needed to be tossed out before we even got started, and this is an extremely well-funded, well-designed data set, what do you think the average secondary analysis is going to be like?