# Don’t believe your numbers too much: IPEDS example

The famous statistician, F.N. (for Florence Nightingale) David was a professor at UC Riverside, where I earned my doctorate. My advisor told this story about her:

We were on this dissertation committee – I forget if it was for biology or what, back then, this was a small campus so if you were in statistics you could end up on any committee. So, he gets to the end of his defense, and F.N. David pulls the cigar out of her mouth and says,

#### “Young man, you believe your numbers far too much.”

The point Dr. Eyman was trying to make to me was that even if you have done every single computation perfectly …

“The government are very keen on amassing statistics. They collect them, add them, raise them to the nth power, take the cube root and prepare wonderful diagrams. But you must never forget that every one of these figures comes in the first instance from the village watchman, who just puts down what he damn pleases.”

– Josiah Stamp

## What is a conscientious statistical consultant to do?

Start with getting to know your data better than God knows the Bible. Let’s start with analyzing secondary data, for example, IPEDS, that has already been collected. I’ll talk about collecting your own data later. Let me just put in a plug for doing it electronically if possible. Also, make sure your data entry staff know which is the intervention and which is the control group. (You think I’m kidding but I’m not.)

### Secondary data analysis: Read the documentation!

You think that is obvious, do you? IPEDS is the Integrated Postsecondary Education Data System, collected by the National Center for Education Statistics. It is my favorite type of data set and the type you almost never get. It includes pretty much the entire population of interest.

#### If you don’t know these things, you don’t know your data:

• Is it a sample or the entire population?
• If it’s a sample, what proportion of the population was sampled and how? Randomly? Stratified random?
• Does the data set have sampling weights? What is the variable name for those weights (You’re going to use them, aren’t you? Please say yes.)
• How were the data recorded?

This isn’t all you need to know. We’ll talk about specific variables next.

One reason I like IPEDS is that you can be pretty sure everyone reported data because it’s mandatory for any institution who gets federal financial aid. It also includes the U.S. service academies, which are about the only post-secondary institutions who don’t. It also gives you a SAS program for reading the data after you upload it. There are also SPSS and STATA programs.

Another thing I liked about IPEDS is it is, inside and out, one of the best documented data sets I’ve seen. I’d recommend it as an example of how to do things if you are going to be creating data sets for secondary analysis yourself. Don’t get used to it, though, because most of what you’ll find in your career is far worse than this. Here is just a simple example from one data set.

```*** Created:    October 2, 2018                                ***;
*** Modify the path below to point to your data file.        ***;
***                                                          ***;
*** The specified subdirectory was not created on            ***;
*** your computer. You will need to do this.                 ***;```

If you want to analyze it using SAS Studio, now you know that once you’ve uploaded the data, you do need to change the INFILE statement. If you don’t know the full path, ctrl-click (Mac) or right-click (Windows) on the data file and select PROPERTIES

Change the INFILE statement to what you see in the path, so now it looks like this

`infile '/home/your_directory/IPEDS/hd2017.csv' delimiter=',' DSD MISSOVER firstobs=2 lrecl=32736;`

You won’t necessarily have the delimiter, etc. It depends on your file. Okay, run it, you have data. Awesome!

When I run frequencies for the IPEDS data, I get 7,153 institutions but the IPEDS methodology report says there are 6,642. What the hell? Looking through the data, I find that 287 institutions were closed in either 2017 or prior. Another 38 were combined with another institution or not to be include for some unspecified reason “out of scope”. There were 41 that were “not primarily post-secondary institutions”, so I dropped those also. Since I’m only interested in individual, active institutions for the research I’m doing, I’m dropping those.

There were 88 institutions that were new in 2017 or had their Title IV (financial aid) eligibility restored. After debating back and forth, I decided to drop those, too. My interest is in developing a baseline of enrollment and retention, which these new institutions will only have for one year.

My point is that I’ve gotten one of the best data sets you could ever find and 7% of the data is inappropriate for my purpose. Does it matter as long as 93% of the data are correct? Well, I definitely think that my results would be less accurate.

My second point is that there is not anything “wrong” with the IPEDS data. I can imagine plenty of circumstances in which one would want to have the data on closed institutions.

These may seem like details, but I am pretty convinced that if you are not a “detail person” you are never going to make it in the long run as a statistical consultants. These details add up fast.

One last thought – if 7% of the data needed to be tossed out before we even got started, and this is an extremely well-funded, well-designed data set, what do you think the average secondary analysis is going to be like?

# 30 Things I Learned in 30 Years as a Statistical Consultant – Part 1 of lots

Never fear, I’m not going to post all 30 things in this post. This is a series. A LONG series. Get excited.

I was invited to speak at SAS Global Forum next year and it occurred to me after thinking about it for 14.2 seconds that there are plenty of people at SAS and elsewhere that are more likely to have new statistics named after them than me.

While I can code mixed models, path analysis and factor analysis without much trouble, I’d be the first to admit that there are plenty of new procedures and ideas I see every year that I never really master. I mean to, I really do, but then I get back to the office and attacked by work. So, the person to introduce you to every facet of the bleeding edge, nope, that’s probably not me, either.

#### If you think this is where I experience impostor syndrome and say “I couldn’t possibly have anything worth saying”, we have obviously never met.

Okay, there’s the most current picture of me, so now you sort of know who I am. I figured I better post a current one because I had not updated my LinkedIn photo in so long that I connected with someone who said,

“Oh, I have met your mom.”

“No, you have met me. My mom is 86 years old and retired to Florida, as federal law requires. Florida state motto: Your grandparents live here.”

### So, when do you get to these 30 things?

Now. I decided to divide everything I learned into four categories.

1. Getting clients
2. Getting data into shape
4. Getting people to understand you.

I picked four because if I had five or six categories, people would expect there to be an even number of points in each because 30 divides evenly by five and six. See? I am good at math.

## The money part: Getting clients

### First, decide what kind of statistical consultant that you want to be.

#### Are you a specialist or a generalist?

You can be like my friend, Kim Lebouton, who specializes in SAS administration for the automotive industry and seems intent on keeping with the same clients until she or they die, whichever comes first. I linked to her twitter because she is too cool to have a web page.

You could be like Jon Peltier of Peltier Tech and specialize in Excel. Basically, if there is anything Jon doesn’t know about Excel, it’s not worth knowing. Personally, I feel as if most things about Excel are not worth knowing, which is why I’m not that kind of consultant.

I do love that the Microsoft Store carries our games for Windows, though, so woohoo for Microsoft.

#### I’m the kind of statistician that doesn’t have a time zone.

A few years ago, I was at a conference when people were trying to coordinate their schedule for an online meeting. They were saying what time zone they were in and someone asked me,

“You’re on Pacific Time, right?”

My friend interrupted and said,

“She doesn’t have a time zone.”

It’s true. I was on Central Time last week, in North Dakota. I’m in California this week. Next week, I’m back on Central Time in Minnesota and South Dakota. The following week, I’m on Eastern Time in Boston.

In the winter here (which was summer there), I was in Chile. During the spring here (which was fall there), I was in Australia, and I’m in the U.S. now.

## BUT HOW DO YOU FIND CLIENTS?

This is probably the question I get the most and I have an odd answer.

#### Get really good at something and the clients will find you.

Jon’s really good with Excel. Kim is superb at SAS administration. What am I good at? I’d say I am excellent at taking something that a client may only be vaguely aware is a statistical problem and solving it from beginning to end, in a way that makes sense to them.

If you try mansplaining me in the comments that what I do is called applied statistics, I will find where you live and slap you upside the head. I teach at National University in the Department of Applied Engineering. It’s in the fucking department name. I KNOW.

In response to the question in stats.stackexchange regarding the difference between mathematical statistics and applied statistics, there was this answer:

Mathematical statistics is concerned about statistical problems, while applied statistics about using statistics for solving other problems.

– Random person I don’t know on the Internet

Mathematical statistics often involves simulated (that is, fake) data, and nearly always uses data that is cleaned of data entry errors – in other words, not very representative of real life.

If you ask me, and even if you don’t , many data scientists act as if data issues can be fixed by having big enough data. This always seems to me similar to those startups who are losing money on every sale but aren’t worried because they are going to make it up on volume. Since data is key, let’s talk about that in the next post.

### But wait! How do you get those first clients?

There is never a surplus of excellence – unless maybe you are an English professor, but they’re not reading this blog.

#### Network.

Let your professors know that you are interested in consulting. I got my first consulting contracts by referrals from professors who had more work than they could do. Similarly, I have referred several potential clients to students and junior professionals either because I was too busy, not interested or they could not afford my rates.

#### Go to conferences

I’ve had clients referred by other consultants who met me at a conference and a particular contract was not in their area of expertise but they thought it might be in mine. Similarly, I’ve referred clients to other people because I don’t really do that thing but maybe this person will be available.

#### Most jobs come by word of mouth

There is an evaluation consultant organization. I don’t know who the hell belongs to it. Much of the work that I do, someone’s job is on the line. That is, if they can’t demonstrate results, they may lose their funding and everyone in the building loses their job. In almost all of it, at some point the project director or manager or whoever is going to go present these results to a federal agency, tribal council or upper management, trusting that everything they say is true because I said so.

In that type of high stakes situation, they’re not going to get someone from an ad on Craig’s list. If that sounds like bad news, the good news is that after you have been around for a while and done good work, the jobs come to you.

Since a big difference between mathematical statisticians and applied statisticians is the messiness of the data, I’m going to address that in the next few posts. Expect more swearing. Because data.