### May

#### 14

Someone had a question about factor analysis with Mplus and even though it is not a piece of software I work with normally, we aim to please at The Julia Group, so I downloaded the demo version and away I went.

It truly was, as my granddaughter says, easy-peasy lemon squeezie.

You might not think so, because the first thing you are confronted with is pretty much a blank window like this

For people who are used to Excel, SPSS, SAS Enterprise Guide or other friendly GUI interfaces, this might be a bit off-putting. However, doing a confirmatory factor analysis was this easy.

1. Create a .dat file from the original file. The file was in a SAS format and I did not have SAS on the laptop I was working on (I’m in Cambridge, MA at the moment). What I did was

- Open the file in SPSS by, from the FILE menu selecting READ TEXT DATA and then selecting SAS as the format
- Ran this SPSS command from the syntax window to output a tab-delimited file with no header, which was the type of input Mplus would expect.

2. Type in this program to do a two-factor solution with the first three variables loading on the first factor and the next three loading on the second factor.

TITLE : Confirmatory Factor Analysis ;

DATA: FILE IS ‘/Users/annmaria/Documents/mplustest/values.dat’ ;

VARIABLE: NAMES ARE q1f1 q2f1 q3f1 q1f2 q2f2 q3f2 ;

MODEL: f1 BY q1f1 q2f1 q3f1 ;

f2 BY q1f2 q2f2 q3f2 ;

OUTPUT: standardized ;

3. Click the RUN button.

That is really all there was to it.

Okay, well that is easy if you knew what to type so let me explain a few things. If you know SAS or SPSS this will be easy.

Each of those things that I put in all capitals is a command in Mplus, analogous to a DATA or PROC step in SAS and a command in SPSS. They don’t need to be in all caps, I just did that for ease for the reader. They DO need to be followed by a colon and then end the statement in a semi-colon.

Title – pretty obvious, gives your output a title.

DATA: FILE IS — gives the path to locate your data.If your file is in the same directory as your program, you don’t need a fully qualified path and can just call it ‘values.dat’

VARIABLE: NAMES ARE

Give the names of your variables. You can specify a format but if you do not Mplus assumes they are in free format, which is the same as what SAS refers to as list format. You might want to note that if you are using the demo version you can only have a maximum of 6 independent and 2 dependent variables.

MODEL: This is my model (duh) and I am modeling two factors. The first factor I creatively named f1 and it is represented BY (notice the BY in the command) variables also creatively named q1f1 q2f1 and q3f1.

Similarly, I have a second factor named f2 ;

I added an OUTPUT statement with a standardized option because I wanted (surprise) standardized estimates. That statement is not required but as you’ll see in my next post on interpreting factor analysis data, you do want it.

I am intrigued by Mplus. It sort of assumes you have close to perfectly cleaned up data because I wouldn’t want to be doing a lot of data management with it, but for doing some relatively complex models – factor analysis, path analysis, structural equation modeling – it looks pretty cool.

### Aug

#### 1

# Confirmatory Factor Analysis with AMOS: OMG it’s THIS button

August 1, 2011 | 10 Comments

I’ve forgotten more about statistical software than you’ll ever know!

I don’t know why people ever say this in a bragging tone because I consider that to be my problem. I’ve forgotten it. Today, I needed to do a confirmatory factor analysis with someone using AMOS. They wanted it in AMOS so that is what I did. All the parameter estimates came out correctly, model fit indices good. The only problem is that I knew that they would want the estimates on the diagram and I could not remember how to do it. Yes, they could have copied and pasted it into Word or Graphic Converter or any of a number of other packages and then typed the numbers into text boxes, but people pay me so they can do less work, not more. Besides, I KNEW there was an option or something. Here it is:

When you are at this point and you have your lovely model well-specified and all of the output shows up nicely when you select TEXT OUTPUT under VIEW

Click the second button at the top of the middle pane. That will put your path estimates on your diagram.

Three other things to remember

- For students, there is a FREE version of AMOS 5. It works great. I was amazed that something free would work so well. I thought when I saw there was a free version for sure it would be a scam, but no, I downloaded it from AMOS Development Corporation, ran it on a Windows 7 machine and it works great. I believe it only runs on Windows.
- The AMOS manual by James Arbuckle is incredibly well-written. I’m using AMOS with SPSS 18 but I noticed he wrote the 19 manual also and it’s pretty much the same. Don’t buy the manual, you can download the pdf free from lots of university sites.
- The Indiana University Stat/Math center has a really straightforward discussion of CFA using AMOS.

Now you know the secret of why I write this blog. Because every time I forget something I remember, “I wrote a post about that once”. And the nice thing about the Internet is even if I’m in Fort Totten, ND or Lac du Flambeau, WI or Tunis, Tunisia, I can look it up and find it.

Unlike my coffee cup. Now, where did I put that?

### Apr

#### 26

I have to choose between either SAS or SPSS for a new course in multivariate statistics. You can take it up with the university if you like, but these are my only two options, in part because the course is starting soon.

I need to decide in a few days which way to go. Here are my very idiosyncratic reasons for one versus the other:

- SPSS
- There is a really good textbook on multivariate statistics that I think would be perfect for these students and it uses SPSS. The book is Advanced and Multivariate Statistics by Mertler & Vannatta, in case you were wondering.
- SPSS can be installed pretty easily on the desktop and these are pretty non-technical students, so that’s a plus.
- The point and click interface for SPSS is pretty easy and similar to Excel which most people have used.
- Personally, I haven’t used SPSS in a while so it would be nice to use something different.

SAS

- Students can just register and go to the website to use SAS Studio
- Structural equation modeling and other advanced statistics procedures built in and not on add-on
- SAS Studio is free vs $80 or so for students and $260 for professor (i.e., me) to buy SPSS academic versions including add-ons needed
- I’m more familiar with SAS and find it easier to code than SPSS syntax.

I’ve toyed with the idea of showing both options but that uses up class time better spent on teaching, for example, how do you interpret a factor loading or AIC.

My big objection to SAS is I can’t find a recent textbook that is good for a multivariate analysis course that is in a social sciences department. The best one is by Cody and that is from 2005. I also use a couple of chapters from the Hosmer & Lemeshow book on Applied Logistic Regression , but I need something that covers factor analysis, repeated measures ANOVA and hopefully, MANOVA and discriminant function analysis, too.

I think most of these students have careers in non-profits and they are not going to be creating new APIs to analyze tweets or anything using enormous databases, so the ability to analyze terabytes is moot. This will probably be their second course in statistics and maybe their first introduction to statistical software.

Suggestions are more than welcome.

For random advice from me and my lovely children, subscribe to our youtube channel 7GenGames TV

P. S. You can skip the hateful comments on why SAS and SPSS both suck and I should be using R, Python or whatever your favorite thing is. Universities don’t usually give carte blanche. These are my two choices.

P.P.S. You can also skip the snarky comments on how doctoral students should have a lot more statistics courses, all take at least a year of Calculus, etc. Even if I might agree with you, they don’t and I need tools that work for the students in my classes, not some hypothetical ideal student.

### Oct

#### 23

# Parallel Analysis Criterion Simplified?

October 23, 2014 | 4 Comments

Am I missing something here? All of the macros I have seen for the parallel analysis criterion for factor analysis look pretty complicated, but, unless I am missing something, it is a simple deal.

The presumption is this:

There isn’t a number like a t-value or F-value to use to test if an eigenvalue is significant. However, it makes sense that the eigenvalue should be larger than if you factor analyzed a set of random data.

Random data is, well, random, so it’s possible you might have gotten a really large or really small eigenvalue the one time you analyzed the random data. So, what you want to do is analyze a set of random data with the same number of variables and the same number of observations a whole bunch of times.

Horn, back in 1965, was proposing that the eigenvalue should be higher than the average of when you analyzed a set of random data. Now, people are suggesting it should be higher than 95% of the time you analyzed random data (which kind of makes sense to me).

Either way, it seems simple. Here is what I did and it seems right so I am not clear why other macros I see are much more complicated. Please chime in if you see what I’m missing.

- Randomly generate a set of random data with N variables and Y observations.
- Keep the eigenvalues.
- Repeat 500 times.
- Combine the 500 datasets (each will only have 1 record with N variables)
- Find the 95th percentile

%macro para(numvars,numreps) ;

%DO k = 1 %TO 500 ;

data A;

array nums {&numvars} a1- a&numvars ;

do i = 1 to &numreps;

do j = 1 to &numvars ;

nums{j} = rand(“Normal”) ;

if j < 2 then nums{j} = round(100*nums{j}) ;

else nums{j} = round(nums{j}) ;

end ;

drop i j ;

output;

end;

proc factor data= a outstat = a&k noprint;

var a1 – a&numvars ;

data a&k ;

set a&k ;

if trim(_type_) = “EIGENVAL” ;

%END ;

%mend ;

%para(30,1000) ;

data all ;

set a1-a500 ;

proc univariate data= all noprint ;

var a1 – a30 ;

output out = eigvals pctlpts = 95 pctlpre = pa1 – pa30;

*** You don’t need the transpose but I just find it easier to read ;

proc transpose data= eigvals out=eigsig ;

Title “95th Percentile of Eigenvalues ” ;

proc print data = eigsig ;

run ;

It runs fine and I have puzzled and puzzled over why a more complicated program would be necessary. I ran it 500 times with 1,000 observations and 30 variables and it took less than a minute on a remote desktop with 4GB RAM. Yes, I do see the possibility that if you had a much larger data set that you would want to optimize the speed in some way. Other than that, though, I can’t see why it needs to be any more complicated than this.

If you wanted to change the percentile, say, to 50, you would just change the 95 above. If you wanted to change the method from say, Principal Components Analysis (the default, with commonality of 1) to saying else, you could just do that in the PROC FACTOR step above.

The above assumes a normal distribution of your variables, but if that was not the case, you could change that in the RAND function above.

As I said, I am puzzled. Suggestions to my puzzlement welcome.

### Jul

#### 15

# The Facts of Factor Patterns

July 15, 2013 | Leave a Comment

About a week ago, I went through pointing and clicking your way to a factor analysis. At the time, I suggested rotating the factors. Now we’re going to interpret the rotated factor pattern. Let me recap, briefly. Agresti and Finlay (p.532) put it way better than me when they said:

Factor analysis is a multivariate statistical technique used for …

1. Revealing patterns of interrelationships among variables

2. Detecting clusters of variables, each of which contains variables that are strongly intercorrelated …

3. Reducing a large number of variables to a smaller number of statistically uncorrelated variables, the factors of factor analysis.

All of which is well and good but once you have your factors, what do they mean? How do you interpret them?

Important point one: The correlation of a variable with a factor is called the loading.

Important point two: To ease interpretation we’d really like to have “simple structure”, that is, where variables load close to 1.0 on one factor and close to zero on the others. I mean, really, if you think about it, if your items load equally on all factors it’s going to be pretty hard to interpret.

Let’s take a look at my example from the 500 Family Study, which you have probably forgotten already. To make it easier to interpret, I copied the factor pattern output into a spreadsheet and sorted by the loadings on the first, second and third factor. You can see that almost all of the items relating to discussion loaded on the first factor. So, I could say that factor 1 is “Communication with parents”. The second factor seems to be mostly about rules, punishment and placing limits, such as punishments or reward for grades, curfew and time out with friends. The discussion questions that load more on this factor than the first are on discussion of breaking rules and discussion of curfew. The third factor is all of the items related to decision-making, with the exception of family purchases, which didn’t really load on any of the three factors.

Notice a few things— Just like correlations, loadings can be positive or negative. How late your curfew is loads negatively on the Rules Factor. That is, families that have stricter rules have an earlier curfew. How often parents limit time out with your friends loads positively on the Rules Factor. Although it’s not ideal, variables can load on more than one factor. As noted, the discussion of breaking rules item loads both on the Communication Factor and the Rules Factor. Variables can not load on any factor at all, like the decision on family purchases. My guess is that most parents decide most purchases without consulting their adolescent children.

The really useful result of factor analysis is that it allows you to take your 42 items, discard one as not really fitting and distill the others down into three factors. Instead of using 41 individual items to predict your outcome of interest, say delinquent behavior, you can use three. It’s almost certain that those three factors will be far more reliable than any individual item, and your results will be far easier to explain as well, say, “Students who have more communication with their parents, moderate rules and moderate input on decision-making have the lowest rate of delinquent behavior and highest academic achievement.”

Not sure if that is true or not but with these factors we are now in a good position to test that. I just need a couple more measures, of delinquent behavior and academic achievement, and I can test my hypotheses. I expect there will be a linear relationship with communication (negative for delinquency and positive for academics) and a curvilinear relationship with the other two measures (inverse for delinquency).

I guess that will be my next thing to do when I have some spare time. Or, you can wander on over to ICPSR.org and download the 500 Family Study data yourself.

### Jul

#### 3

# Factor analyzing a correlation matrix is SO easy (I am not making this up)

July 3, 2013 | 2 Comments

When I first went to graduate school in the 1970s (yes, for my MBA, in 1978) , if one were to comment casually,

“And then I factor-analyzed the correlation matrix to solve that problem.”

Everyone would say,

“Ooooh.”

Back in those days before statistical calculators (forget the Internet), when computers only came in big (really big) blue boxes and SAS probably was still compiled on punched cards and your department had to pay for computer time, you did it **by hand** if you were a broke graduate student. Yes, I mean literally with a pencil in your hand. First, you computed the correlations between each of the items and then you applied some equations you can find in any detailed book on factor analysis, which I had forgotten and then looked up again in the documentation here. It doesn’t matter anyway because no one does it that way any more.

Since I had SAS Enterprise Guide open, I just pointed and clicked my way through. First I clicked on the data set I had opened and then went to the TASKS menu, pulled down to MULTIVARIATE and then CORRELATIONS.

Next, I just clicked on the variables I wanted to use in my analysis and clicked the blue arrow in between panes of the window to select them as my analysis variables. You can also shift-click to select a bunch at once. Don’t click RUN yet!

You need to output the correlation matrix as a SAS dataset of type= CORR. Fortunately, that is super-duper easy. You just click in the far left pane on the option that says OUTPUT DATA. Then click in the box next to SAVE OUTPUT DATA . Now you can click RUN.

Now that you have a correlation matrix, you can go ahead and factor analyze it just like you did before. Click here if you don’t remember.

Note that

“The data set created by the CORR procedure is automatically given the TYPE=CORR data set option, so

you do not have to specify TYPE=CORR.”

I’m pretty sure this was not always the case, but according to the SAS documentation, it is now. So, now that you have your input data set, you just click on it, go to TASKS > MULTIVARIATE > FACTOR ANALYSIS like before and there you have it.

But …. what exactly do you have? Check back here for my next post titled (I am not making this up), “What’s all that factor analysis crap mean, anyway?”

=================================================

Learn math. Save lives. Learn culture. Kill animals. (Relax, it’s a game.)

Yeah, speaking of killing animals, The Perfect Jennifer is quite upset about shooting the buffalo in the game as she is a vegetarian and extremely soft-hearted even about shooting virtual animals. She asked Dr. Longie, our Dakota cultural consultant, didn’t they have vegetarians back in the day.

He said, “Yes, the Dakota had people who didn’t eat meat. We called them ‘Bad Hunters’.”

### Nov

#### 5

# Survival analysis tip from Ogden Nash

November 5, 2011 | Leave a Comment

I might as well give you my opinion of these two kinds of sin as long as, in a way, against each other we are pitting them,

And that is, don’t bother your head about the sins of commission because however sinful, they must at least be fun or else you wouldn’t be

committing them.

It is the sin of omission, the second kind of sin,

That lays eggs under your skin.

The way you really get painfully bitten

Is by the insurance you haven’t taken out and the checks you haven’t added up the stubs of and the appointments you haven’t kept and the bills you haven’t paid and the letters you haven’t written.

I had to respond to the post by Mike Nemecek and tweets by Rick Wicklin quoting Shakespeare with some culture of my own. Not having any degrees in liberal arts, though, the best I could do was this excerpt from the poem by Ogden Nash, Portrait of the Artist as a Prematurely Old Man.

Lately I’ve been playing around with the PROC LIFETEST procedure and it occurred to me that a way to get painfully bitten with this, and other survival analysis procedures, is not to think about some obvious facts. I’m assuming you are new to these procedures, either that, or in a big hurry, and you don’t scrutinize your output carefully. In that case, you may misinterpret the mean survival rate.

The mean survival rate is the mean length of time people/ bacteria / rats survived, right? Not necessarily.

Many procedures, say factor analysis, regression – automatically drop observations with missing data. Survival analysis procedures don’t work exactly the same way.

I am telling you this because it is a mistake I have seen people make who were familiar with other statistical procedures, and I can only presume in a hurry. Their solution to not knowing the length of survival time for some of their subjects was to drop those for whom the data are unknown.

Let’s try this with a data set I have laying around. I only use those observations for which I have complete data, that is, I know the survival time. It gives me the following information on survival time in days.

Mean Standard Error

360.934 22.183

Quartile Estimates

95% Confidence

Percent Estimate Transform [Lower Upper)

75 532.00 LOGLOG 489.00 612.00

50 308.00 LOGLOG 244.00 393.00

25 167.00 LOGLOG 117.00 193.00

Easy. right? The mean survival time is 360 days. The median is 308 days.

However, this is only using those people for whom we have a survival time. What about the other people? When I include EVERYONE, whether they died or not, I get the following

Mean | Standard Error |

431.466 | 22.506 |

So, is this the correct value then? Are these the correct percentile points?

Percent | Point Estimate |
Transform | [Lower | Upper) |

75 | 652.00 | LOGLOG | 560.00 | 755.00 |

50 | 428.00 | LOGLOG | 341.00 | 512.00 |

25 | 192.00 | LOGLOG | 157.00 | 237.00 |

Well, not exactly. In fact, if you are using SAS, you will see this helpful note in your log

`The mean survival time and its standard error were underestimated because the largest observation was censored and the estimation was restricted to the largest event time. `

In my sample data set, 25% of the observations were censored, that is, we don’t know when they died.

Can we then say that the median survival rate really is 428 days? Okay, the mean is not correct because those 25% some of them may have died years later. What about the median? The answer is, “It depends.”

Depends on what, you might ask. Well, it depends on why you don’t know when they died.If they dropped out of your study and you have no idea what happened to them, then I would say that you want to be a bit cautious in your interpretation of both the mean and the median survival rates.

If all of the people who you don’t know how long they lived are censored because your study ended and they were still alive, then I would say, yes, the median survival time is accurate, assuming you had data on all of those people at the beginning and the end of the study.

Although survival analysis is generally thought of as predicting whether one survived in the sense of not dying, it is also used for other applications, such as predicting how long people last in a treatment program, without committing another crime, without drinking or using drugs. In those cases, when people are lost by dropping out of your study and out of sight, I would suspect it is very possible that at least some of those people began drinking, committing crimes or whatever prior to the end of your study. So, when people are censored due to having missing data, I would be a bit skeptical of both the median and the mean, and, as with all data problems, the larger percentage of your sample this affects, the more worried about it I would be.

An interesting point came up the other day when I was listening to a lecture. I’d assumed in animal studies that you would only have the problem of right-censored data due to the subjects living past the end of the study period. She mentioned that a couple of her subjects were censored because the rats had died of other causes unrelated to the disease under study. Not sure what occupational hazards a rat faces, but it’s getting pretty bad when you can’t even count on a rat these days.

### Aug

#### 23

# Giving Students Their Money’s Worth Online

August 23, 2020 | 2 Comments

Lately, there’s been a lot of talk about making college, or younger, students feel as if they are really getting the same education when teaching online versus in the classroom.

As someone who has taught online since 1997 (yes, you read that right) and has taught the same classes both in the classroom and online, I have a few suggestions.

## Online Classes Can be Better than Face to Face

### Record Your Lectures

The very first suggestion I have is to record your lectures and make those downloadable. The university where I teach has Blackboard and this is an option. If your school does NOT have that option for whatever web meeting software you have and you have a Mac you can make a screen recording with QuickTime and upload it to a YouTube account.

### Share Data Libraries

I teach multivariate statistics and we use some methods that require at least a modest sample size. Having students type in hundreds of records is ridiculous. Even better, I can download and clean data from sites like ICPSR or the California Health Interview Survey.

I upload the codebooks to the class website.

I upload these files to a class directory using SAS Studio. I give my students the LIBNAME with read-only access and they have a data set with thousands or tens of thousands of records all set to analyze.

For assignments where data cleaning is part of it, I give them access to the original data.

#### Yes, you can get SAS for free

Students can get a SAS studio account for free, run their programs, download and send me both their log and their output.

### Make Cheating Less Tempting

Friends who are new to teaching online say cheating is a real problem. I try to remove the temptation by making it harder. If I give you a dataset with 500 variables and ask you to pick 20, run a factor analysis, write up your results and send me your log and output I can at least see it was run under your account and it’s not going to be the same exact variables as someone else.

That doesn’t mean a student can’t have paid someone to do it for them or had a relative do it. [I was shocked to read on a forum all the women who said they did their husband’s masters degree homework “Because the degree will help our household income and he works all day.”]

This type of cheating isn’t something you can prevent in face-to-face classes either unless you have the student write all of their papers in front of you.

One way to make cheating less tempting is to have assignments that students can individualize. A change I made for the fall semester is to give two different data sets for each assignment. One is the Monitoring the Future study with survey data from youth, the other is the California Health Interview Survey.

I try to update these datasets fairly frequently, so I just replaced the 2009 CHIS with the 2018 data set.

So, if you are interested in social science or health analytics, you can pick whatever interests you. Sometimes the most hard core engineering majors pick the MTF study of youth because they have an adolescent at home and are curious about national norms, how adolescents rate their communication with parents, etc.

Still, I would like a third data set with something more marketing or engineering focused. If anyone has a suggestion, hit me up in the comments please.

### Have Online Discussion Boards and Don’t Make Them Stupid

These boards should not be just a waste of time. Again, related to the preventing cheating, I often ask questions related to their papers, like,

“What variables are you thinking of using for your factor analysis assignment? Do you see any possible problems with those variables?”

The second part of each question is to ask another question for the next student to answer.

I’m fortunate that I often have students who are in the same cohort so they know each other and will comment on something related to the other students’s work or interest.

### Get to Know Your Students

I taught middle school students this summer in a Game Design Course and it was a blast. (We’re doing it again this fall, if you have a middle schooler you’d like to sign up, click here to get info and put GAME DESIGN in the message).

Whether middle school or adults I ask them to turn on the camera and say hi the first day just so I know what they look like and their voice.

Just like an in-person class, I start by asking everyone where they are from, making sure I know how to pronounce their names correctly and ask them to tell me one interesting thing. For the middle school students it might be the name of their dog or that they play saxophone. For the adults it might be that they work at CDC or really want to do research on infant mortality in Nigeria, where they grew up.

### If you’re not a jerk, online classes can be better for your students

I have heard of instructors who insist all students have on their camera at all times, not on mute, be dressed appropriately, no distracting background. That’s just stupid. For my adult students, they may have small children running around, they may be making dinner. I don’t care. Why should I? If they miss something, they can replay the video later. This is one way online classes are BETTER for adult students.

For my middle school students, maybe they are embarrassed about their room, their looks. As someone who has taught middle school, I can tell you that there is almost nothing a middle school student can’t be embarrassed about. Maybe they are lounging on their bed while listening to me. So what? This is a way online classes might be better for younger students.

#### Also, don’t be a jerk about the chat.

I do read all the chat messages that go on while I’m talking. If it is a question to me, I answer it. Some students feel more comfortable typing/ texting than talking.

My adult students never veer too far off tasks. With the middle school ones, I might need to drop into the chat from time to time and say “Enough with the poop emojis”. Usually, though , their classmates do it for me.

Well, I have lots more ideas but it’s Saturday and I have to finish writing my assignments for next month.

## If you’ve been wondering why I haven’t been blogging for four months –

Well, there’s a pandemic, and demand for educational software has spiked, our 7 Generation Games company has upped both users and employees 50%, The Julia Group has more of a demand for online training, analysis and app development so, yeah, been busy.

### Jan

#### 19

# The first things a statistical consultant needs to know

January 19, 2020 | 3 Comments

I’ll be speaking about being a statistical consultant at SAS Global Forum in D.C. in March/ April. While I will be talking a little bit about factor analysis, repeated measures ANOVA and logistic regression, that is the end of my talk. The first things a statistical consultant should know don’t have much to do with statistics.

### A consultant has paying clients.

In History of Psychology (it was a required course, don’t judge me) one of my fellow students chose to give her presentation as a one-woman play, with herself as Sigmund Freud. “Dr. Freud” began his meeting with a patient discussing his fee. In fact, Freud did not accept charity patients. He charged everyone. There’s a winning trivial pursuit fact for you.*

Why am I starting with telling you this? Because I have had plenty of graduate students whose goal is “to be a consultant” but they seem to think their biggest problem when they start out is going to be whether they should do propensity score matching using the nearest neighbor or caliper method.

## Here are the biggest problems you’ll face:

- Getting your first clients
- Getting paid
- Getting your data into shape
- Communicating results to your clients.

Let’s start with getting clients. I can think of four ways to do this; referrals, as part of a consulting company, through your online presence and through an organization. I’ve done three of them. First, and most effective, I think, is through referrals. I got my first two clients when professors who did consulting on the side recommended me. I do this myself. If someone can’t afford my fees or I am just booked at the moment, I will refer potential clients to either students, former students or other professionals I know who are getting started as a consultant. It’s not competing with my business. I am never going to work for $30 an hour again and if that’s all that’s in your budget, I understand. If all you need is someone to do a bunch of frequency distributions and a chi-square for you, you don’t need me, although I’m happy to do it as a part of a larger contract.

### Lesson number one: Don’t be a jerk.

Referrals mean I’m using my own reputation to help you get a job and so I’m going to refer students who are good statisticians and who I think will be respectful and honest with the client. Don’t underestimate the latter half of that statement.

### Lesson number two: It helps if you really love data analysis.

I’d be the first to say that I’m a much nicer person now than when I was in graduate school. Yes, it took me a while to learn lesson one, I am embarrassed to say. However, I really did love statistics and if any of my fellow students had trouble, I was the first person they asked and I was really happy to help. When those students later became superintendents of schools or principal investigators of grants, they thought of me and became some of my earliest clients. Some of my professors also became clients, although those were after I’d had several years of experience.

### Lesson number three: Don’t think you are smarter than your clients.

A young relative, who has a Ph.D. In math asked me, “No offense but isn’t what you do relatively easy, like anyone who understood statistics could do it? Why are you so in demand?”* Corollary to this lesson: If you find yourself saying, “No offense” just stop talking right then.*

One reason a lot of want-to-be consultants go bankrupt or have to find another line of work is they do think they are smarter than their clients. This manifests itself in a lot of ways so we’ll return to it later, but one way is that they charge much more than the work is worth.

#### How do you know how much your work is worth?

### Lesson number four: Ask yourself, if I had twice as many grants/ contracts as I could do and I was paying someone to do this work, what would I be willing to pay?

That’s a good place to start.

I’ve met a lot of people over the years who charged much more than me and bragged to me about it. In the long run, though, I’m sure I made a lot more money. Clients talk. They find out that you are charging them three times as much as their friend down the block is getting charged by their consultant. You may think you’re getting away with it, but you won’t. You may get paid on those first few contracts but you’ll have a very hard time getting work in the future.

### Lesson number five: Know multiple languages, multiple packages

I’ve had discussions with colleagues on whether it is better to be a generalist or a specialist.

I have had a few jobs where they just needed propensity score matching or just a repeated measures ANOVA but those have been the small minority over the past 30 years.

I would argue that even those who consider themselves specialists actually have a wide range of skills. Maybe they are only an expert in SAS but that includes data manipulation, macros, IML and most statistical procedures.

In my case, I would not claim to be the world’s greatest authority on anything but if you need data entry forms created in JavaScript/HTML/CSS, a database back end with PHP and MySQL, your data read into SAS, cleaned and analyzed in a logistic regression, I can do it all from end to end. No, I’m not equally good at all of those. It’s been so long since I used Python, that I’d have to look everything up all over again.

I’ve used SPSS, STATA, JMP and Statistica, depending on what the client wanted. I think I might have even had a couple of clients using RapidMiner. For the last few years, though, the only packages I’ve used have been SAS and Excel. Why Excel? Because that’s what the clients were familiar with and wanted to use and it worked for their purposes. (See lesson three.)

I was really surprised to read Bob Muenchen saying SPSS surpassed R and SAS in scholarly publications. Almost no one I know uses SPSS any more, but, of course, my personal acquaintances are hardly a random sample. I suppose it depends on the field you are in.

## I have never used R.

Some people think this is a political statement about being a renegade. Others think it’s because I’m too old to learn new things or in subservience to corporate overlords or some other interesting explanation. (The Invisible Developer, who has been reading over my shoulder, says he never got past C, much less D through Q.)

Since I fairly often get asked why not, I will tell you the real reasons, which is a complete digression but this is my blog so there.

- In my spare time that I don’t have, I teach Multivariate Statistics at a university that uses SAS. Since I’m using SAS in my class anyway and need real life data for examples, when a client has licenses for multiple packages and doesn’t care what I use (almost always the case), I use SAS.
- About the time that R was taking off, my company was also taking off in a different direction. The Invisible Developer and I own the majority of 7 Generation Games which is an application of a lot of the research done by The Julia Group. When we started developing math games, we needed to learn Unity, C#, PHP, SQL, JavaScript, HTML/CSS. We also needed to analyze the data to assess test reliability, efficacy, etc. I called the analysis piece and told The Invisible Developer I was interested in all of it so I’d do whatever was left. He was really interested in 3D game programming so he did the Unity/C# part. I did everything else. Then, after a few years, I moved to Chile, where the language I most had to improve was my Spanish.

It worked out for me. We have a dozen games available from 7 Generation Games and now we’re coming out with a new line on decision-making.

I mention all this because I want to emphasize there isn’t a single path to succeeding as a consultant. There isn’t a specific language or package you have to learn. There is one thing you absolutely must have, though, and that’s the next post.

* (See Warner, S. L. Sigmund Freud and Money. (1989) Journal of the American Academy of Psychoanalysis. Winter;17(4):609-22)

### Sep

#### 16

# 30 Things I Learned in 30 Years as a Statistical Consultant – Part 1 of lots

September 16, 2019 | Leave a Comment

Never fear, I’m not going to post all 30 things in this post. This is a series. A LONG series. Get excited.

I was invited to speak at SAS Global Forum next year and it occurred to me after thinking about it for 14.2 seconds that there are plenty of people at SAS and elsewhere that are more likely to have new statistics named after them than me.

While I can code mixed models, path analysis and factor analysis without much trouble, I’d be the first to admit that there are plenty of new procedures and ideas I see every year that I never really master. I mean to, I really do, but then I get back to the office and attacked by work. So, the person to introduce you to every facet of the bleeding edge, nope, that’s probably not me, either.

#### If you think this is where I experience impostor syndrome and say “I couldn’t possibly have anything worth saying”, we have obviously never met.

Okay, there’s the most current picture of me, so now you sort of know who I am. I figured I better post a current one because I had not updated my LinkedIn photo in so long that I connected with someone who said,

“Oh, I have met your mom.”

And I had to reply,

“No, you have met

. My mom is 86 years old and retired to Florida, as federal law requires. Florida state motto: Your grandparents live here.”me

### So, when do you get to these 30 things?

Now. I decided to divide everything I learned into four categories.

- Getting clients
- Getting data into shape
- Getting answers
- Getting people to understand you.

I picked four because if I had five or six categories, people would expect there to be an even number of points in each because 30 divides evenly by five and six. See? I am good at math.

## The money part: Getting clients

### First, decide what kind of statistical consultant that you want to be.

#### Are you a specialist or a generalist?

You can be like my friend, Kim Lebouton, who specializes in SAS administration for the automotive industry and seems intent on keeping with the same clients until she or they die, whichever comes first. I linked to her twitter because she is too cool to have a web page.

You could be like Jon Peltier of Peltier Tech and specialize in Excel. Basically, if there is anything Jon doesn’t know about Excel, it’s not worth knowing. Personally, I feel as if most things about Excel are not worth knowing, which is why I’m not that kind of consultant.

I do love that the Microsoft Store carries our games for Windows, though, so woohoo for Microsoft.

#### I’m the kind of statistician that doesn’t have a time zone.

A few years ago, I was at a conference when people were trying to coordinate their schedule for an online meeting. They were saying what time zone they were in and someone asked me,

“You’re on Pacific Time, right?”

My friend interrupted and said,

“She doesn’t have a time zone.”

It’s true. I was on Central Time last week, in North Dakota. I’m in California this week. Next week, I’m back on Central Time in Minnesota and South Dakota. The following week, I’m on Eastern Time in Boston.

In the winter here (which was summer there), I was in Chile. During the spring here (which was fall there), I was in Australia, and I’m in the U.S. now.

## BUT HOW DO YOU FIND CLIENTS?

This is probably the question I get the most and I have an odd answer.

#### Get really good at something and the clients will find you.

Jon’s really good with Excel. Kim is superb at SAS administration. What am I good at? I’d say I am excellent at taking something that a client may only be vaguely aware is a statistical problem and solving it from beginning to end, in a way that makes sense to them.

If you try mansplaining me in the comments that what I do is called applied statistics, I will find where you live and slap you upside the head. I teach at National University in the Department of Applied Engineering. It’s in the fucking department name. I KNOW.

In response to the question in stats.stackexchange regarding the difference between mathematical statistics and applied statistics, there was this answer:

Mathematical statistics is concerned about statistical problems, while applied statistics about using statistics for solving other problems.

– Random person I don’t know on the Internet

Mathematical statistics often involves simulated (that is, fake) data, and nearly always uses data that is cleaned of data entry errors – in other words, not very representative of real life.

If you ask me, and even if you don’t , many data scientists act as if data issues can be fixed by having big enough data. This always seems to me similar to those startups who are losing money on every sale but aren’t worried because they are going to make it up on volume. Since data is key, let’s talk about that in the next post.

### But wait! How do you get those first clients?

There is never a surplus of excellence – unless maybe you are an English professor, but they’re not reading this blog.

#### Network.

Let your professors know that you are interested in consulting. I got my first consulting contracts by referrals from professors who had more work than they could do. Similarly, I have referred several potential clients to students and junior professionals either because I was too busy, not interested or they could not afford my rates.

#### Go to conferences

I’ve had clients referred by other consultants who met me at a conference and a particular contract was not in their area of expertise but they thought it might be in mine. Similarly, I’ve referred clients to other people because I don’t really do that thing but maybe this person will be available.

#### Most jobs come by word of mouth

There is an evaluation consultant organization. I don’t know who the hell belongs to it. Much of the work that I do, someone’s job is on the line. That is, if they can’t demonstrate results, they may lose their funding and everyone in the building loses their job. In almost all of it, at some point the project director or manager or whoever is going to go present these results to a federal agency, tribal council or upper management, trusting that everything they say is true because I said so.

In that type of high stakes situation, they’re not going to get someone from an ad on Craig’s list. If that sounds like bad news, the good news is that after you have been around for a while and done good work, the jobs come to you.

Since a big difference between mathematical statisticians and applied statisticians is the messiness of the data, I’m going to address that in the next few posts. Expect more swearing. Because data.

« go back — keep looking »## Blogroll

- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog