Dec
27
Macros, SQL and Reading CSV with SAS – Part 1
December 27, 2022 | Leave a Comment
I wish I knew then what I know now
Close to 40 years ago, I was a bright, young industrial engineer who had been using SAS for a year or so. I thought it was the best thing since sliced bread and could solve all problems. An older engineer, who did not have much use for all of those new-fangled computer things was told by his supervisor to talk to me about automating some type of factory loading. I don’t remember the details but I think it had something to do with figuring out how many people to have work each shift based on the hours estimated necessary to make each part and the number of each type of part needed.
I don’t remember the details but I remember that I couldn’t solve the problem
The types of parts and number of each changed daily. The part numbers were sometimes numeric and sometimes a combination of letters and numbers. There was not a defined length for part number and the part numbers and the number required of each were not in specific columns. Both SAS and I have come a long way since then and I am pretty sure I could solve a problem like that now.
I don’t do a lot with SAS these days but I had a problem that reminded me of that first stumbling block
A student had a question about analyzing a longitudinal dataset. Some work had been done on it by a graduate student who has graduated and moved away.
The problem involved the following:
- The previous student had used a macro to read in XLS datasets but this year’s datasets were in CSV formats
- In some datasets the unique identifier was 25 characters in length, in some it was 12, in others 20.
- The database had been tested quite a bit by the graduate student who left, and all of the test students needed to be removed. Fortunately, each had the word TEST somewhere in the username, except for some that had “UNDEFINED” or no username at all because the initial programmer had skipped over the login step to test some feature.
- The unique identifier went by different names in different datasets. In some it was called ‘student’, in others it was ‘username’ .
- Some of the datasets this year did not exist last year.
- They wanted the total number (non-duplicated count) of students for each year, for each school.
- Students duplicated across datasets were not really duplicate users. For example, School A might have a JuanGarcia and a JuanGarcia2 but School B could have a different JuanGarcia and it was a different person.
Solving parts 1, 2, 3 and 4
Fixing part 1 was super easy. To change the macro, I just switched from DBMS = XLSX to DBMS = CSV and it read in the CSV files. Easy peasy.
For part 2, I created a variable named student and gave it a length of 25, which was the longest length any of the (many!) individual datasets had for an actual user’s name.
I used the UPCASE function to make all the user names upper case and set the student variable to have that value. I used the INDEX function to find and delete any that had the word TEST or UNDEFINED or were blank.
%macro imp_csv(refd=,outd=,school=) ;
// ALL I NEEDED TO DO here was change the DBMS = XLSX in the original macro to DMBS = CSV ;
PROC IMPORT DATAFILE=&refd
DBMS=CSV
REPLACE
OUT=work.&outd;
GETNAMES=YES;
// CREATING A PERMANENT DATASET HERE SO NEED A LIBNAME STATEMENT;
DATA in.&outd;
SET work.&outd ;
LENGTH student $ 25. ;
student = UPCASE(username) ;
if INDEX(student,"TEST") > 0 or INDEX(student,"UNDEFINED") > 0 or student = "" THEN DELETE ;
school = &school;
%mend ;
I used this macro for the half of the datasets that had a unique identifier of “username”. The other half all had a unique identifier of “student” and, in all cases, a length of 25 characters. I actually used a different macro to read in the other half because those had a whole different set of problems, but that is a post for another day.
Calling the macro
FILENAME reffile1 ‘/home/annmaria/data_analysis_examples/schoolA.csv’;
%imp_csv(refd=reffile1,outd=schoolA, school=”Academy of A”);
For each file, I simply needed to give the location of the csv file and then in the macro call, give that file reference, the name I wanted that output dataset and the name of the school. Since I am creating permanent datasets in this macro, I’d need to have a LIBNAME statement in the macro before it is called.
Okay, that is Part 1. As you can probably guess, after I had done Part 1, Part 2 was super simple. Actually, Part 1 was pretty simple, too.
In my day job, I make educational games and teach courses on entrepreneurship at a tribal college. When I am on vacation (and I am currently on a five-day vacation, which is my longest in 49 years , I help students with their research with SAS code and write blogs about it.
I will try to get to Part 2 tomorrow when I get back from the aquarium with my grandchildren.
Jan
30
Converting to fiscal years, using SAS
January 30, 2022 | Leave a Comment
Often when analyzing data for annual reports, you need to summarize by fiscal year but people normally enter calendar dates. Also, if you didn’t know, now you know …
SAS and Excel use different date formats
If you have uploaded your data from an Excel file, you want to convert from an Excel date to a SAS date by subtracting 21, 916. This is because the reference date from which days are numbered in excel is January 1, 1900 and in SAS it is January 1, 1960. Coincidentally, 365.25*60 = 21,915 and at the century mark, we get shorted one day because 1900, 2000 etc. do not have a February 29. It’s true. Look it up. You can also read this article on SAS and Excel dates by Erik Tilanus. Or you can just quit being difficult and subtract 21,916.
Use four functions to find fiscal year
Annoyingly, my data were read in as character data, even though it was the number of days since January 1, 1900 which is definitely a number.
Function 1: Use input to convert data from character to numeric
The input function can be used to either convert data from character to numeric or vice versa. Simply include two parameters, the name of the variable you are reading in and the informat. Since I am creating a new variable, I may as well subtract the 21,916 from it at the same time.
Functions 2 and 3: Year and Month
If your fiscal year started in January, you wouldn’t be doing this, so clearly you need to get the month and year from your date. These are pretty obvious functions. Year gives you the year, and month gives you -wait for it – the month. Just give the name of the variable holding your date.
Function 4 : Use CATX to create a fiscal year variable
There is certainly more than one way to do this. If, for example, you wanted fiscal year 2020-21 to be referred to as 2020, you could just have something like:
if appmonth < 10 then fiscal_year = appyear ;
else fiscal_year = appyear + 1;
I can certainly see how you might want to do that if, for example, you were going to correlate year with some other variable and needed a numeric variable. In general, though, I find it a bit confusing when a date is January 24, 2022 and the year shows 2021.
The CATX function will concatenate any variables. The first parameter is what you want as a delimiter between the variables. You could use a space, but here I wanted a dash so that the result is something like 2021-2022.
For the current dataset, the fiscal year runs from October 1 to September 30 of the following year, so, if the month is less than 10, the beginning of the fiscal year was the prior year and the fiscal year ends in the year of the application date. If the month is 10 or greater, the beginning of the fiscal year is that year and the end of the fiscal year is the following year.
data test2 ;
set mydata.fixdata22 ;
app_date = input(application_date, 8.) – 21916 ;
appyear = year(app_date) ;
appmonth = month(app_date) ;
if appmonth < 10 then fiscal_year = catx(“-“, appyear -1, appyear) ;
else fiscal_year = catx(“-“, appyear, appyear +1);
So, that’s it. Now you have a new variable that allows you to subset and report on your data by fiscal year.
I wrote this post because my first thought of how to do this was embarrassingly complicated. I honestly don’t know what I was thinking. It involved multiple arrays and a Do-loop and then I gave it 30 seconds more thought and realized the solution was actually super easy.
Aug
23
Giving Students Their Money’s Worth Online
August 23, 2020 | 2 Comments
Lately, there’s been a lot of talk about making college, or younger, students feel as if they are really getting the same education when teaching online versus in the classroom.
As someone who has taught online since 1997 (yes, you read that right) and has taught the same classes both in the classroom and online, I have a few suggestions.
Online Classes Can be Better than Face to Face
Record Your Lectures
The very first suggestion I have is to record your lectures and make those downloadable. The university where I teach has Blackboard and this is an option. If your school does NOT have that option for whatever web meeting software you have and you have a Mac you can make a screen recording with QuickTime and upload it to a YouTube account.
Share Data Libraries
I teach multivariate statistics and we use some methods that require at least a modest sample size. Having students type in hundreds of records is ridiculous. Even better, I can download and clean data from sites like ICPSR or the California Health Interview Survey.
I upload the codebooks to the class website.
I upload these files to a class directory using SAS Studio. I give my students the LIBNAME with read-only access and they have a data set with thousands or tens of thousands of records all set to analyze.
For assignments where data cleaning is part of it, I give them access to the original data.
Yes, you can get SAS for free
Students can get a SAS studio account for free, run their programs, download and send me both their log and their output.
Make Cheating Less Tempting
Friends who are new to teaching online say cheating is a real problem. I try to remove the temptation by making it harder. If I give you a dataset with 500 variables and ask you to pick 20, run a factor analysis, write up your results and send me your log and output I can at least see it was run under your account and it’s not going to be the same exact variables as someone else.
That doesn’t mean a student can’t have paid someone to do it for them or had a relative do it. [I was shocked to read on a forum all the women who said they did their husband’s masters degree homework “Because the degree will help our household income and he works all day.”]
This type of cheating isn’t something you can prevent in face-to-face classes either unless you have the student write all of their papers in front of you.
One way to make cheating less tempting is to have assignments that students can individualize. A change I made for the fall semester is to give two different data sets for each assignment. One is the Monitoring the Future study with survey data from youth, the other is the California Health Interview Survey.
I try to update these datasets fairly frequently, so I just replaced the 2009 CHIS with the 2018 data set.
So, if you are interested in social science or health analytics, you can pick whatever interests you. Sometimes the most hard core engineering majors pick the MTF study of youth because they have an adolescent at home and are curious about national norms, how adolescents rate their communication with parents, etc.
Still, I would like a third data set with something more marketing or engineering focused. If anyone has a suggestion, hit me up in the comments please.
Have Online Discussion Boards and Don’t Make Them Stupid
These boards should not be just a waste of time. Again, related to the preventing cheating, I often ask questions related to their papers, like,
“What variables are you thinking of using for your factor analysis assignment? Do you see any possible problems with those variables?”
The second part of each question is to ask another question for the next student to answer.
I’m fortunate that I often have students who are in the same cohort so they know each other and will comment on something related to the other students’s work or interest.
Get to Know Your Students
I taught middle school students this summer in a Game Design Course and it was a blast. (We’re doing it again this fall, if you have a middle schooler you’d like to sign up, click here to get info and put GAME DESIGN in the message).
Whether middle school or adults I ask them to turn on the camera and say hi the first day just so I know what they look like and their voice.
Just like an in-person class, I start by asking everyone where they are from, making sure I know how to pronounce their names correctly and ask them to tell me one interesting thing. For the middle school students it might be the name of their dog or that they play saxophone. For the adults it might be that they work at CDC or really want to do research on infant mortality in Nigeria, where they grew up.
If you’re not a jerk, online classes can be better for your students
I have heard of instructors who insist all students have on their camera at all times, not on mute, be dressed appropriately, no distracting background. That’s just stupid. For my adult students, they may have small children running around, they may be making dinner. I don’t care. Why should I? If they miss something, they can replay the video later. This is one way online classes are BETTER for adult students.

For my middle school students, maybe they are embarrassed about their room, their looks. As someone who has taught middle school, I can tell you that there is almost nothing a middle school student can’t be embarrassed about. Maybe they are lounging on their bed while listening to me. So what? This is a way online classes might be better for younger students.
Also, don’t be a jerk about the chat.
I do read all the chat messages that go on while I’m talking. If it is a question to me, I answer it. Some students feel more comfortable typing/ texting than talking.
My adult students never veer too far off tasks. With the middle school ones, I might need to drop into the chat from time to time and say “Enough with the poop emojis”. Usually, though , their classmates do it for me.
Well, I have lots more ideas but it’s Saturday and I have to finish writing my assignments for next month.
If you’ve been wondering why I haven’t been blogging for four months –
Well, there’s a pandemic, and demand for educational software has spiked, our 7 Generation Games company has upped both users and employees 50%, The Julia Group has more of a demand for online training, analysis and app development so, yeah, been busy.
Apr
9
Before the pandemic happened, I was planning on speaking at the SAS Global Forum on things I had learned as a statistical consultant. I wanted to call it “This is a hill I will die on” but one of my students suggested “This is a hill I will not die on” was a better title. However, by the time I had this idea the deadline for changing anything in your paper had already passed so the title was

From Santiago to the Spirit Lake Nation: 30 things I learned in 30 years as a statistical consultant
You can click the link above and read it.
My point is that I am a serious person doing serious things – some of the time and tomorrow I will write about statistics. However … since there is a blogging challenge going on …
Today, Eva and I decided to write about quarantine clothes
I am hardly the fashion plate at the best of times. In my bio for The Family Textbook, which is hilarifying and you can purchase for the measly sum of $2.99 it mentions my proclivity for collecting weird socks, which is true. It also notes that I have never sent a dick pic. Also true.

The first rule of web meetings is to wear clothes
The Invisible Developer, also the Chief Technology Officer of 7 Generation Games, contrary to popular belief, is very seldom bossed around by me. However, here is where I draw the line. When he proposed that he could be on time for a daily morning meeting – incidentally, at 11 am – if he attended in his bathrobe, I declared the meeting could start late and he would be clothed. We do, after all, have a sexual harassment policy around here and I am pretty sure showing up in video calls in your bathrobe under which you may or may not be wearing underwear violates it.
Rule #1 Does Not Apply if Your Camera is Turned Off
Gonzalo, a senior software developer, almost never appears with the web camera turned on and when he does, he was wearing a mask before it was cool. No, not like an N95 mask but like a “I’m-a-member-of-the-horde-from-World-of-Warcraft” mask.
If you think I am kidding, check out this video on designing video games which includes Gonzalo and his very cool mask.
When I mentioned the clothing required rule he said,
Wait, what? You can’t attend the meeting in your pajamas?
I told him that rule only applied if your camera was turned on, and then he calmed down considerably.
Rule #2: Only what can be seen on camera matters
Which is why, today, it was perfectly appropriate for me to attend three meetings wearing a plain, long-sleeved blue shirt, a hoodie, long underwear pants and sock monkey slippers.
Rule #3 All quarantine outfits can be improved by well-chosen socks
I have socks with flamingos, sushi, my granddaughter’s face, multi-colored chihuahuas and World War II female welders.

Rule #4: Some meetings are so stupid, they require special socks

I try to avoid useless meetings that should have been an email but sometimes these are unavoidable. In this case, it is extremely important to have the correct socks because you can look down and appear to be studiously considering whatever dumb ass suggestion the other person has just made.
Rule #5 For people who say you need to dress professionally for web meetings, see rule #4
Apr
8
The Blog Hour
April 8, 2020 | Leave a Comment
My granddaughter was bored.
She had been home for three weeks, in Minnesota, which meant much of her time was spent indoors because it is cold outside and she lives in a city.

This week was even worse because it was spring break and she said,
Me and my friends used to think that if we had no school and we could just stay home all the time it would be great but really it’s HORRIBLE.
Making it even worse, she and her sister were supposed to be spending spring break in Santa Monica with us, chilling by the beach and meeting up with friends from her old school.

Recently, we’d created a WordPress site for her but it had nothing but the sample pages that came with it. She said she couldn’t think of anything to write. So, I said:
I challenge you to The Blog Hour!
Every day now, at 7 pm Pacific Time, we call each other and start blogging. There are no rules except that we need to start at (about) 7 pm and blog for no more than one hour. At 8 pm, promptly, we both stop.
You are welcome to join us
If you do, send me a link to your blog.
Eva’s first post was on Quarantine Ideas
Mine was Everything is NOT fine
Yesterday, she wrote on Quarantine Food
And I wrote about ideas to De-stress during a Pandemic
Something I have learned about blogging over the years …
There is no difference between the blogs you wrote because you felt inspired and those you wrote because of a challenge to write X number of words/ posts
I’ve been writing this blog for a dozen years, I did a judo blog pretty regularly for over a decade and I write posts on the 7 Generation Games blog, sometimes on life and sometimes on math.
When I look back over the years, I find it impossible to pick out the posts I did because there was some kind of public or personal challenge and those I wrote because I really felt strongly about what I had to say that day.

SO … if you are stuck in the house and need a challenge, Eva and I are throwing it down. Join us!
Check out the follow up post on fashion advice from me. Those of you who have met me in person are already rolling your eyes.

Apr
7
Being (less) stressed during a pandemic
April 7, 2020 | Leave a Comment
Probably like many of you who read this blog, this pandemic has lasted longer for me than most people. Statistics is my thing. I teach it, I make games about it , I code statistical analyses and I provide statistical consulting.
A few weeks ago, there were 1.9 cases of Coronavirus per million people in the United States. I remember looking at the growth curves in the U.S. and around the world, thinking to myself,
“Oh, no, this is not going to be good.”
We’re now about 3,000 times the rate of infection we were then. It’s no wonder we’re all stressed.
Checking death statistics 10 times a day isn’t good for you
Initially, I checked the Worldometer site several times a day, thinking it could not possibly be as bad as I thought. No one else seemed to be that worried.
When everything started shutting down and more people were seriously concerned, I still spent my first hour every morning browsing the news on the virus. It was all bad and I found it hard to concentrate on work. Little things annoyed me.
I was already staying inside, not seeing my friends and family, working from home. Did me knowing exactly how much the death rate had climbed since yesterday do any good?
No, of course it didn’t. That was a rhetorical question.
What you should do instead
Start the morning with something you want to do.
For some people it might be a jog or a bike read. Good for you. I did enough training when I was young to last until I’m 200. (I’m serious. Google it.)
Mine may sound really dorky but on my list for a long time has been wanting to get better at WordPress. I write this blog and one on the 7 Generation Games site. I wrote a blog on mostly judo and life for a dozen years, though I rarely update that any more.
I took some courses on lynda.com for a month and then I got busy for 8 months and did nothing. So, now I am back at it.

Every morning, I lay in bed, drink a cup of coffee and watch videos or read a book on WordPress
Whatever you’ve been WANTING to do, do that thing
Notice I said “wanting”, not “felt you should do”. No one looks forward to the next morning when they are going to clean out the junk drawer in the kitchen or do their taxes.
Three of the things I like most are coffee, sleeping late and programming. So, now, every morning, that is how I start my day.
Even better, my husband usually gets up, grinds the coffee beans and brings me up a cup so I don’t even have to get out of my warm bed.

Tell the people who think you should start your day with the things you have to do that they should go eat a frog
You’re at home. You’re going to be home ALL fucking day! You can start off by playing a video game for an hour.
Get library card
Seriously, libraries are amazing. Before you start whining that the libraries are closed, know this …
Many libraries allow you to apply for a card online during the current pandemic
I have a card for the Los Angeles Public Library, the Santa Monica Public Library and, as a faculty member, I also have access to the National University library.
Through the Los Angeles library, I can download 15 ebooks a month using the Hoopla app. I can also download ebooks owned by the library and read these on a Kindle or iBooks app.
There is an app called Kanopy through which I can get six movies a month free.
I really like documentaries, so here is a place I’ve found a lot of interesting ones.
The Santa Monica Public Library only allows 6 downloads a month with Hoopla, which is why I needed two library cards!
There is just a lot of cool stuff, from apps to learn languages to checking out newspapers. Don’t want to subscribe just to read that one article? Use your library card.
Okay, so there are my two recommendations for today:
- Start your day with something you want to do.
- Check out the free books, movies, magazines, newspapers and apps from your public library.
You can also play AzTech: The Story Begins, on your iPhone or iPad while you are waiting to hear my next post recommendation. It’s an interesting idea. You’ll see.
Apr
3
Everything is NOT just fine
April 3, 2020 | 2 Comments
I’ve read a lot of cheery tweets that said something like,

“Buffy, Biff and I are isolated at home with our terrier, Boo. Here’s a picture. Isn’t he cute? We played card games, then I baked this three-course meal I saw on Pinterest. Biff is taking this time to finally become proficient in Mandarin with a course he is taking online.”
Seriously, what is WRONG with you people?
Now, those are the people we all want to slap, but there is another group that is more worrying. If working remotely is your usual mode, you are still drawing a paycheck and no one in your family is seriously ill, you may feel as if you should be going about life as usual.
I was in that group. After all, I have an office in my house where I usually work when I’m not traveling. My husband works upstairs. I’ve taught online for years. So, I’m in the same place, doing the same thing. Other people have real problems. Everything is fine.
Everything is NOT fine
A very sensible tweet I read said something like,
If you haven’t eliminated at least one student assignment, you are doing it wrong. Students are having to do their classes on line, have lost jobs, have jobs for which demand has skyrocketed overnight, have children or siblings at home interrupting them, have to share a computer, don’t have Internet access. They can’t go to the beach or the gym to de-stress. Some are home with abusive parents or partners. Expecting the same level of work is clueless.
I thought, “Well, yeah, I am sure that is true for students who are living in poverty, who are in elementary or middle school, but I teach graduate students who are professionals.”
Then … I got the assignments that were due after everything began locking down. Now, I should preface this by saying I have taught the same course for the same university for seven years. Over the past couple of years, the admission requirements for the program have been tightened, so the average student is more prepared.
My highly qualified graduate students made mistakes that I know they would not normally make
How do I know this? Before Coronavirus was an every day word, their work was as good or better than the average class. As the country began to shut down, they began to make mistakes at a far higher rate than my previous classes. These were particularly more common on problems that required detailed attention. For example, looking at the data to see that the subject numbers were all duplicated and then identifying this as a problem that requires repeated measures analysis.
I made mistakes that I would normally never make
One thing I am usually scrupulous about is data quality and data integrity. In fact, it was a major part of the paper I was supposed to give at SAS Global Forum – which was cancelled. The whole conference was cancelled, that is, not just my paper. Yet, I uploaded the wrong data set to the course directory, didn’t do any descriptive statistics and barely glanced at the PROC CONTENTS. Of course I know better!
The first step in solving a problem is admitting you have a problem
If you’ve read this blog for a long time you may know that I’m not a particular fan of poetry. However, I do know there was a poem with the title, “No man is an island.” (See, not as completely uncivilized as you thought!)
Even if you are healthy, have a safe place to live and a paycheck, you probably know people who don’t
Even if everyone you know- lucky you – is healthy, wealthy and wise, there is the probability that any one of you can get hit tomorrow. Your dad, grandmother or child can become sick. Someone in your family or a close friend can lose a job.
Your daily routine has been disrupted
You can’t go to the gym, church, the library, the mall. Maybe, like many of my friends, your judo club or church is where you used to spend many hours every week and now you can’t go there. People who were important in your life you can’t see any more. Maybe you can’t see your family and friends because they are at high risk due to health problems and have to self-isolate.
Yes, you aren’t living in a slum with no running water, so maybe you feel as if you should be “just shaking it off” and finding some “quarantine project” like Biff and Buffy.
Let me tell you this, Biff and Buffy are assholes. It’s perfectly normal to be anxious. The DEFINITION of anxiety is
A feeling of worry, nervousness, or unease, typically about an imminent event or something with an uncertain outcome.
Oxford Dictionary
We are definitely living in uncertain times.
So, now that we have admitted that it’s normal to feel anxious, the next post is some tips on what to do about it, without sounding TOO much like Buffy.
Feb
23
5 Basics of Consulting Success: Part 1
February 23, 2020 | Leave a Comment
Last week, I mentioned that successful consultants have five categories of skills; communication, testing, statistics, programming and generalist.
COMMUNICATION
Communication is the number one most important skill. All five are necessary to some extent, but a terrific communicator with mediocre statistical analysis skills will get more business than a stellar statistician that can’t communicate. Communication is a lot more than explaining results to clients or making small talk at meet ups.
Documentation
Communication includes documentation, both in your code and internal documents such as codebooks or an internal wiki. It includes letting clients know what you’re going to do, what it’s going to cost, what that cost includes, what were your results and what those results mean. If you’re good at communicating with clients, colleagues and your future self, you’re half-way to success.
An example of the critical nature of communication can be found in the following retraction:
The identified programming error was in a file used for preparation of the analytic data sets for statistical analysis and occurred while the variable referring to the study “arm” (ie, group) assignment was recoded. The purpose of the recoding was to change the randomization assignment variable format of “1, 2” to a binary format of “0, 1.” However, the assignment was made incorrectly and resulted in a reversed coding of the study groups.”
Aboumatar and Wise (2019, p. 1417)
Because of this incorrect coding, the reported results were the exact opposite of what actually occurred.
Document coding!
Here is an example from a current research project where the CES-D depression scale was used, which requires several items to be reverse-coded before scoring.
In the HTML file where the user enters data that’s written to the database there is this comment:
<h5 >I felt that I was just as good as other kids.</h5>
<! –– This is reverse-coded. Don’t you dare change it. ––>
<div class=”row mb-3″>
<button id=”cesd4_1″ data-src=”3″ class=”cesd4 btn btn-light shadow-box col-5 my-3 mx-auto”>Not at all</button>
In the original file to read in the data to SAS, there is a comment:
*** NOTE: CESD IS ALREADY REVERSE-CODED. DOES NOT NEED CODING!;
FILENAME REFFILE2 ‘/home/directory3/data_analysis_examples/crossroads/cesd.xlsx’;
In the internal wiki, there is this note:
Tables in Acme Project Database
CESD – Center for Epidemiologic Studies Depression Scale – NOTE: The data are reverse coded at data entry. There is no need to reverse code these. There are 25 columns in this table; ID, username, session number, questions 1 through 20 of the CESD scale, the CESD total which is the sum of the 20 questions, named item21 for some odd reason, and a time stamp.
Document everything! Document how are items coded, how subscales or totals are computed.
This may seem like overkill, but how many retractions could be prevented by this level of documentation? If you are a consultant, it’s probable that at some point someone else will be looking at these data, or that you may be called back a year later to do a longitudinal analysis. Your colleagues and future you will thank you. A year or two from now, I don’t want to be looking at this data set and wondering if I need to reverse-code those items or if it was already done. I want to KNOW!
I deeply suspect that there are more erroneous results published due to incorrect coding of data than to incorrect analyses. After all, the peer reviewers, editors and readers see how you analyzed your data. No one sees how you coded it but you and, possibly, the person who has your position after you.
Feb
17
The one skill a statistical consultant must have
February 17, 2020 | Leave a Comment
A few weeks ago, I ended my post with “there is one thing a statistical consultant absolutely must have and promised to say what that is in the next post. Maria and I had just picked up our rental car at the Detroit airport when she turned to me and asked:
So, what is the one thing a statistical consultant has to have?
I told her,
“I have absolutely no idea what I was thinking last month!”
In my defense, I have been in five states and 22 cities in the past 21 days. Maria says it is only 16 because I was in Minneapolis, Fargo and Denver twice each. She also says I can’t count Denver, Chicago or San Francisco since I only changed planes there. Poo!

Now that I am back in Los Angeles and my brain has unfrozen I think there are actually five things you must have but one of these is the most important. In my not at all humble opinion, though, you need ALL FIVE.
The actually five skills a statistical consultant must have

- COMMUNICATION – This is the number one most important skill. If you don’t have the rest, you’ll still suck and be unemployed but a terrific communicator with mediocre statistical analysis skills will get more business. I don’t just mean shaking hands and small talk at conferences, either. Communication includes documentation, both in your code and in codebooks, an internal wiki, etc. It includes letting clients know what you’re going to do, what it’s going to cost, what that cost includes, what were your result and what those results mean. If you’re good at communicating with clients, colleagues and your future self, you’re half-way to success.
- TESTING – I’ve ranted on this blog a lot about testing because it is one of the areas where people often seem to fall short early in their careers. I got a lot of hate for this post when I said I don’t hire self-taught developers because there are things they don’t teach themselves adequately, like testing.
- Statistics – Well, duh. Props to the person in the Chronicle of Higher Education forum whose signature read, “Being able to find SPSS in the start menu does not qualify you to run a multinomial logistic regression.” Your clients may not know what power, quasi-complete separation or multicollinearity mean in interpreting an analysis. They do trust that you understand whatever is necessary to be understood for the work. Don’t let them down.
- Programming – when I was a graduate student Very Important Professors had lowly peon graduate students and programmers to write their code for them. All of those people had started their careers using punched cards, (honest!) it was that long ago. All of the statistical consultants I know do programming, or can code their own analyses if necessary. Even if you aren’t doing it all yourself – I’m certainly not these days – you need to know enough to review the code your minions wrote or help said minions when they get stuck. Sometimes, it’s just quicker to do it by yourself than explain to someone else, especially if you need to fix a bug in a code that a client is waiting on.
- Be a generalist – I’ll have more to say about this in future posts. In brief, even the consultants I know who are well-known specialists in one language know and use others. If you think your career is going to be you sitting on a mountain or in penthouse office, pontificating to others about sums of squares, the computation of Wilks’ lambda or options for PROC GLMSELECT , you are going to be sadly disappointed. On the other hand, if you do know of a job like that, I would consider taking it for a sufficiently large quantity of money.
Jan
19
The first things a statistical consultant needs to know
January 19, 2020 | 3 Comments
I’ll be speaking about being a statistical consultant at SAS Global Forum in D.C. in March/ April. While I will be talking a little bit about factor analysis, repeated measures ANOVA and logistic regression, that is the end of my talk. The first things a statistical consultant should know don’t have much to do with statistics.
A consultant has paying clients.
In History of Psychology (it was a required course, don’t judge me) one of my fellow students chose to give her presentation as a one-woman play, with herself as Sigmund Freud. “Dr. Freud” began his meeting with a patient discussing his fee. In fact, Freud did not accept charity patients. He charged everyone. There’s a winning trivial pursuit fact for you.*
Why am I starting with telling you this? Because I have had plenty of graduate students whose goal is “to be a consultant” but they seem to think their biggest problem when they start out is going to be whether they should do propensity score matching using the nearest neighbor or caliper method.
Here are the biggest problems you’ll face:
- Getting your first clients
- Getting paid
- Getting your data into shape
- Communicating results to your clients.
Let’s start with getting clients. I can think of four ways to do this; referrals, as part of a consulting company, through your online presence and through an organization. I’ve done three of them. First, and most effective, I think, is through referrals. I got my first two clients when professors who did consulting on the side recommended me. I do this myself. If someone can’t afford my fees or I am just booked at the moment, I will refer potential clients to either students, former students or other professionals I know who are getting started as a consultant. It’s not competing with my business. I am never going to work for $30 an hour again and if that’s all that’s in your budget, I understand. If all you need is someone to do a bunch of frequency distributions and a chi-square for you, you don’t need me, although I’m happy to do it as a part of a larger contract.
Lesson number one: Don’t be a jerk.
Referrals mean I’m using my own reputation to help you get a job and so I’m going to refer students who are good statisticians and who I think will be respectful and honest with the client. Don’t underestimate the latter half of that statement.

Lesson number two: It helps if you really love data analysis.
I’d be the first to say that I’m a much nicer person now than when I was in graduate school. Yes, it took me a while to learn lesson one, I am embarrassed to say. However, I really did love statistics and if any of my fellow students had trouble, I was the first person they asked and I was really happy to help. When those students later became superintendents of schools or principal investigators of grants, they thought of me and became some of my earliest clients. Some of my professors also became clients, although those were after I’d had several years of experience.
Lesson number three: Don’t think you are smarter than your clients.
A young relative, who has a Ph.D. In math asked me, “No offense but isn’t what you do relatively easy, like anyone who understood statistics could do it? Why are you so in demand?”
Corollary to this lesson: If you find yourself saying, “No offense” just stop talking right then.
One reason a lot of want-to-be consultants go bankrupt or have to find another line of work is they do think they are smarter than their clients. This manifests itself in a lot of ways so we’ll return to it later, but one way is that they charge much more than the work is worth.
How do you know how much your work is worth?
Lesson number four: Ask yourself, if I had twice as many grants/ contracts as I could do and I was paying someone to do this work, what would I be willing to pay?
That’s a good place to start.
I’ve met a lot of people over the years who charged much more than me and bragged to me about it. In the long run, though, I’m sure I made a lot more money. Clients talk. They find out that you are charging them three times as much as their friend down the block is getting charged by their consultant. You may think you’re getting away with it, but you won’t. You may get paid on those first few contracts but you’ll have a very hard time getting work in the future.
Lesson number five: Know multiple languages, multiple packages
I’ve had discussions with colleagues on whether it is better to be a generalist or a specialist.
I have had a few jobs where they just needed propensity score matching or just a repeated measures ANOVA but those have been the small minority over the past 30 years.
I would argue that even those who consider themselves specialists actually have a wide range of skills. Maybe they are only an expert in SAS but that includes data manipulation, macros, IML and most statistical procedures.
In my case, I would not claim to be the world’s greatest authority on anything but if you need data entry forms created in JavaScript/HTML/CSS, a database back end with PHP and MySQL, your data read into SAS, cleaned and analyzed in a logistic regression, I can do it all from end to end. No, I’m not equally good at all of those. It’s been so long since I used Python, that I’d have to look everything up all over again.
I’ve used SPSS, STATA, JMP and Statistica, depending on what the client wanted. I think I might have even had a couple of clients using RapidMiner. For the last few years, though, the only packages I’ve used have been SAS and Excel. Why Excel? Because that’s what the clients were familiar with and wanted to use and it worked for their purposes. (See lesson three.)
I was really surprised to read Bob Muenchen saying SPSS surpassed R and SAS in scholarly publications. Almost no one I know uses SPSS any more, but, of course, my personal acquaintances are hardly a random sample. I suppose it depends on the field you are in.
I have never used R.
Some people think this is a political statement about being a renegade. Others think it’s because I’m too old to learn new things or in subservience to corporate overlords or some other interesting explanation. (The Invisible Developer, who has been reading over my shoulder, says he never got past C, much less D through Q.)
Since I fairly often get asked why not, I will tell you the real reasons, which is a complete digression but this is my blog so there.
- In my spare time that I don’t have, I teach Multivariate Statistics at a university that uses SAS. Since I’m using SAS in my class anyway and need real life data for examples, when a client has licenses for multiple packages and doesn’t care what I use (almost always the case), I use SAS.
- About the time that R was taking off, my company was also taking off in a different direction. The Invisible Developer and I own the majority of 7 Generation Games which is an application of a lot of the research done by The Julia Group. When we started developing math games, we needed to learn Unity, C#, PHP, SQL, JavaScript, HTML/CSS. We also needed to analyze the data to assess test reliability, efficacy, etc. I called the analysis piece and told The Invisible Developer I was interested in all of it so I’d do whatever was left. He was really interested in 3D game programming so he did the Unity/C# part. I did everything else. Then, after a few years, I moved to Chile, where the language I most had to improve was my Spanish.

It worked out for me. We have a dozen games available from 7 Generation Games and now we’re coming out with a new line on decision-making.
I mention all this because I want to emphasize there isn’t a single path to succeeding as a consultant. There isn’t a specific language or package you have to learn. There is one thing you absolutely must have, though, and that’s the next post.
* (See Warner, S. L. Sigmund Freud and Money. (1989) Journal of the American Academy of Psychoanalysis. Winter;17(4):609-22)
Blogroll
- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog