Never fear, I’m not going to post all 30 things in this post. This is a series. A LONG series. Get excited.

I was invited to speak at SAS Global Forum next year and it occurred to me after thinking about it for 14.2 seconds that there are plenty of people at SAS and elsewhere that are more likely to have new statistics named after them than me.

While I can code mixed models, path analysis and factor analysis without much trouble, I’d be the first to admit that there are plenty of new procedures and ideas I see every year that I never really master. I mean to, I really do, but then I get back to the office and attacked by work. So, the person to introduce you to every facet of the bleeding edge, nope, that’s probably not me, either.

If you think this is where I experience impostor syndrome and say “I couldn’t possibly have anything worth saying”, we have obviously never met.

I’m the old person on the left. The youngest of many daughters is on the right.

Okay, there’s the most current picture of me, so now you sort of know who I am. I figured I better post a current one because I had not updated my LinkedIn photo in so long that I connected with someone who said,

“Oh, I have met your mom.”

And I had to reply,

“No, you have met me. My mom is 86 years old and retired to Florida, as federal law requires. Florida state motto: Your grandparents live here.”

So, when do you get to these 30 things?

Now. I decided to divide everything I learned into four categories.

  1. Getting clients
  2. Getting data into shape
  3. Getting answers
  4. Getting people to understand you.

I picked four because if I had five or six categories, people would expect there to be an even number of points in each because 30 divides evenly by five and six. See? I am good at math.

The money part: Getting clients

First, decide what kind of statistical consultant that you want to be.

Are you a specialist or a generalist?

You can be like my friend, Kim Lebouton, who specializes in SAS administration for the automotive industry and seems intent on keeping with the same clients until she or they die, whichever comes first. I linked to her twitter because she is too cool to have a web page.

You could be like Jon Peltier of Peltier Tech and specialize in Excel. Basically, if there is anything Jon doesn’t know about Excel, it’s not worth knowing. Personally, I feel as if most things about Excel are not worth knowing, which is why I’m not that kind of consultant.

I do love that the Microsoft Store carries our games for Windows, though, so woohoo for Microsoft.

Canoe the rapids and learn fractions, with your kids or by yourself because maturity is overrated

I’m the kind of statistician that doesn’t have a time zone.

A few years ago, I was at a conference when people were trying to coordinate their schedule for an online meeting. They were saying what time zone they were in and someone asked me,

“You’re on Pacific Time, right?”

My friend interrupted and said,

“She doesn’t have a time zone.”

It’s true. I was on Central Time last week, in North Dakota. I’m in California this week. Next week, I’m back on Central Time in Minnesota and South Dakota. The following week, I’m on Eastern Time in Boston.

In the winter here (which was summer there), I was in Chile. During the spring here (which was fall there), I was in Australia, and I’m in the U.S. now.

BUT HOW DO YOU FIND CLIENTS?

This is probably the question I get the most and I have an odd answer.

Get really good at something and the clients will find you.

Jon’s really good with Excel. Kim is superb at SAS administration. What am I good at? I’d say I am excellent at taking something that a client may only be vaguely aware is a statistical problem and solving it from beginning to end, in a way that makes sense to them.

If you try mansplaining me in the comments that what I do is called applied statistics, I will find where you live and slap you upside the head. I teach at National University in the Department of Applied Engineering. It’s in the fucking department name. I KNOW.

In response to the question in stats.stackexchange regarding the difference between mathematical statistics and applied statistics, there was this answer:

Mathematical statistics is concerned about statistical problems, while applied statistics about using statistics for solving other problems.

– Random person I don’t know on the Internet

Mathematical statistics often involves simulated (that is, fake) data, and nearly always uses data that is cleaned of data entry errors – in other words, not very representative of real life.

If you ask me, and even if you don’t , many data scientists act as if data issues can be fixed by having big enough data. This always seems to me similar to those startups who are losing money on every sale but aren’t worried because they are going to make it up on volume. Since data is key, let’s talk about that in the next post.

But wait! How do you get those first clients?

There is never a surplus of excellence – unless maybe you are an English professor, but they’re not reading this blog.

Network.

Let your professors know that you are interested in consulting. I got my first consulting contracts by referrals from professors who had more work than they could do. Similarly, I have referred several potential clients to students and junior professionals either because I was too busy, not interested or they could not afford my rates.

Go to conferences

I’ve had clients referred by other consultants who met me at a conference and a particular contract was not in their area of expertise but they thought it might be in mine. Similarly, I’ve referred clients to other people because I don’t really do that thing but maybe this person will be available.

Most jobs come by word of mouth

There is an evaluation consultant organization. I don’t know who the hell belongs to it. Much of the work that I do, someone’s job is on the line. That is, if they can’t demonstrate results, they may lose their funding and everyone in the building loses their job. In almost all of it, at some point the project director or manager or whoever is going to go present these results to a federal agency, tribal council or upper management, trusting that everything they say is true because I said so.

In that type of high stakes situation, they’re not going to get someone from an ad on Craig’s list. If that sounds like bad news, the good news is that after you have been around for a while and done good work, the jobs come to you.

Since a big difference between mathematical statisticians and applied statisticians is the messiness of the data, I’m going to address that in the next few posts. Expect more swearing. Because data.

I had more than the two tips on becoming a better programmer than I gave in the last post but I had run out of margarita. Now, being replenished with tequila and fresh lime by The Invisible Developer, here are two more. He often tells me that I should refer to myself as a developer and not a programmer because that is beneath me. I have never pretended to be cool. I started with punched cards as a programmer and a programmer I will remain. At least until the second margarita.

margarita
It’s Friday!

If you aren’t familiar with github, you could have gone to Chris Hemedinger’s super demo at SAS Global Forum. We use github for version control and it is indispensable for that. When you have several people working on the same program, I can edit files, you can, too, and we all upload and download the latest versions without copying over each other’s code. If you are on a project with more than one developer, once you have used a git repository, you’ll fight anyone who tries to take it away.

Because it is so good for sharing, github is used a lot for open source projects and for people just making their code publicly available.

The main thing I learned that I didn’t know is that there is a https://github.com/sassoftware

I had just assumed since SAS is a private company and definitely not open source that there would not be much available. I was wrong.

Whatever language you use, there is probably a github for it.

Here is a funny thing. When I first started learning JavaScript, I scavenged github to find examples of people making simple games like tic-tac-toe , Memory or mazes. I’d modify the code to do what I wanted and I thought all of these people were so much smarter than me.

After I learned a bit more, sometimes I saw functions or libraries in the code that didn’t do anything and I realized that a lot of these people had done the exact same thing as me – copied someone else’s code and modified it for their purposes.

Start by copying code from github, but don’t stop there

If you ask me – and even if you don’t, I’m going to tell you anyway – it is absolutely fine to download code from someone else’s repository on github and tweak it a little for your own purposes. However, don’t stop there! Dive into it. Figure out what each function does, try to understand their logic.

A better person than me would have their own public git repository. Oh well, I have a bucket of private ones for work and I’ve been writing this blog for 11 years, so that will have to do. YOU should definitely have public repository, though. Changing the subject here …

Git Repositories that are NOT python, R or Viya

The top repositories almost all entail either integrating SAS and Python (not surprising because it is open source) or Viya or Visual Analytics (presumably because it is expensive and SAS wants to promote it). There are also a smattering of SAS-and-R repositories in the top hits and repositories for SAS and iOS and SAS and Android. I’m not interested in any of that at the moment.

Right now, I am super-swamped but I should have some free time over the summer, so here are my personal interests I am marking for later. With 116 repositories, any SAS aficionado should find something of interest, and remember, this is just the sassoftware repository. There are additional repositories of individual users, like the last one I noted below

SAS Studio Tasks is an area I’d like to learn more about, as in writing your own custom tasks.

Data mining is an area I am ALWAYS wanting to brush up on more . This library of flow diagrams for specific data mining topics looks really cool.

Not a SAS Institute repository, this one from Michael Friendly is on macros and looks super cool.

As I mentioned above, I started using github for JavaScript code and there are TONS of repositories for just about any language that would tickle your fancy (what exactly IS a fancy, anyway?)

I have more tips but it will have to wait for another margarita and since my grandchildren are spending the weekend and just invaded my office, that will have to wait.

I did a random sample of presentations at SAS Global Forum today, if random is defined as of interest to me, which let’s be honest, is pretty damn random most of the time. 

Tip #1 Stalk Interesting People

I don’t mean in a creepy showing up at their hotel room way. If you see someone presenting either in person or referenced in twitter, blogs, etc. , check out what else that person has freely available on the web, in published proceedings, etc.

Let me give you an example that applies even if you are not into logistic regression. (You’re not? Feel shame.)

The first session I attended was a Super Demo in the exhibit hall which for some reason I don’t understand is always called the Quad. 

In a nutshell, logistic regression is usually 

  • binary, which is where I started out, modeling mortality studies, you’re either dead or alive
  • multinomial, that is, multiple categories, like college major or religion or 
  • ordinal , like someone is a subscriber, contributor, editor or administrator on a group blog, which are progressively higher levels of involvement

What if the data fit the proportional odds model for some of the explanatory variables and not others? You can do a partial proportional odds model. 

Line plots on slide
Graphing your data is a great way to see if the proportional odds model makes sense. You can see that it does for the variable on the right, but for the left, not so much.

Unfortunately, the super demos do not have a paper published in the app or proceedings, however, the presenter, Bob Derr from SAS mentioned he had presented a paper on this topic in 2013 (way to play hard to get, Bob – not!)

Paper reference on slide (also below in blog)

I skipped the next presentation to read it (and to write this post). If you are at all interested in multinomial and ordinal logistic regression, you should, too. You can find it here in the SAS Global Forum 2103 proceedings. http://support.sas.com/resources/papers/proceedings13/446-2013.pdf

It’s an outstanding paper and I am going to require it for my course next year. I think the students will find it far more accessible than some of the readings we have been using. They don’t complain loudly, but I know, I know. 

Tip #2 Read the Documentation (No, seriously, keep reading)

People who answer comments with LMGTFY (let me Google that for you) or RTFM (read the fucking manual), just so you know, that quit being funny around 1990. However, SAS documentation really is a treasure trove. It’s not just SAS, the same could be said about jQuery documentation or the WordPress Codex but we’re not talking about those today, are we? Please try to stay on topic. 

The SAS documentation runs many, many thousands of pages. It’s far better and more detailed than you would think. Let me give you an example a very helpful person named Michael pointed out in the Quad (what the hell is it with that name?) today. As I’ve mentioned several times lately, my students often struggle with repeated measures ANOVA. He suggested checking out the page on longitudinal data analysis.

http://support.sas.com/rnd/app/stat/procedures/LongitudinalAnalysis.html

It gives four different procedures (none of which are GLM, I noted, but that’s a discussion for another day). 

Related to that, I recommend when you are learning procedures just running some of the code examples. For example, here is one for repeated measures with PROC MIXED. http://documentation.sas.com/?docsetId=statug&docsetTarget=statug_mixed_examples02.htm&docsetVersion=15.1&locale=en. (Yes, I really do have that on my mind lately)

Think about this, though. Once you graduate from whatever your last degree turns out to be, you don’t have anyone checking your work and telling you if it is right or not. You just write your code and hope for the best. That sucks, huh?

When you are learning a new procedure, you can write code using the data shown in the SAS documentation and see if your results match. Like an answer key for life! I always wanted one of those.

Since the last few posts detailed errors in repeated measures with PROC GLM , I thought I should acknowledge that people seem to struggle just as much with PROC MIXED.

Forgetting data needs to be multiple rows

This is one of the first points of confusion for students. When you do a PROC MIXED, you need multiple records for each person. So, thinking back to my previous example with three time points, with PROC MIXED, and two options for treatment, my dataset needs to look like this:

SubjectExamTreatmentScore
1PreTalk43
1PostTalk46
1FollowTalk45
2PreDrug39

With GLM, you’d have 3 variables, named Pre, Post and Follow, for example (you *did* read the last post, right?). In PROC MIXED, your dataset has to be structured so that you have one variable, in this case, named “exam”, and it takes on one of three possible values.

Let’s start with the simplest case. I’d like to know, just like before, if there was a change from pre to post-test and if it was maintained at follow-up. In other words, my question is, “Was there a difference between the pretest and post-test and a difference between pretest and follow-up six months later?” I am not particularly interested in the post-test/ followup difference as such. Here is one way to code it:

Proc mixed data = example ;
class subject exam;
model score = exam ;
random subject ;
contrast “pre vs post” exam 1 -1 0 ;
contrast “pre vs follow” exam 1 0 -1 ;


The PROC GLM code from this post will give you the exact same results as the code above, but only if you have your data structured so that you have three variables instead of three records for each person.

I have a lot to say about CONTRAST statements, which I love, and random effects, about which I am neutral, and nested effects, that are not relevant to this example but could be. However, I am trying to not work past 9 pm and it’s already an hour later so … until next time.

Also, if you’re at SAS Global Forum, be sure to meet up and say “Hey!”

This is my day job …

Check it out. I make games that teach math, including, of course, statistics. Play AzTech: Meet the Maya – the only statistics game with Honduran fruit bats.

As I said in my last post, repeated measures ANOVA seems to be one of the procedures that confuses students the most. Let’s go through two ways to do an analysis correctly and the most common mistakes.

Our first example has people given an exam three times, a pretest, a posttest and a follow up and we want to see if the pretest differs from the other two time points.

proc glm data = example ;
model pre post follow = /nouni ;
repeated exams 3 contrast (1) /summary printm ;

Among other things, this will give you a table of Type III Sum of Squares that tells you that you have a significant difference across time. It will also give you contrasts between the 1st treatment and each of the other two.

You can see all of the output produced here.

This is using PROC GLM and so it requires that you have multiple VARIABLES representing each of the multiple times you measured people. This is in contrast to PROC MIXED which requires multiple records for each subject. We’ll get into that another day.

One thing that throws people all of the time is they ask, “Where did you get the exams variable?” In fact, I could have used any valid SAS name. It could have been “Nancy” instead of “exams” and that would have worked just as well. It’s a label we use for the factor measured multiple times. So, as counterintuitive as it sounds, there is NO variable named “exams” in your data set.

Let’s try a different example. This time, I have a treatment variable. I have administered two different treatments to my subjects. I want to see if treatment has any effect on improvement.

proc glm data =example ;
class treatment ;
model pre post follow = treatment/ nouni ;
repeated exams 3 /summary ;

The fixed effect does *not* go in your REPEATED statement

In this case, I do need a CLASS statement to specify my fixed effect of treatment. A really common mistake that students make is to code the REPEATED statement like this:

repeated treatment 3 /summary ; *WRONG! ;

It seems logical, right? Why would you use a completely made up name instead of one of your variables? If you think about it for a minute, though, treatment wasn’t repeated. Each subject only received one type of treatment.

When you are asking whether one group improved more than the other(s) what you are asking is, “Is there an interaction effect?” You can see by the table of Type III Sums of Squares produced below that there was no interaction effect.


A significant effect for the repeated measure does not mean your treatment worked!

A common mistake is to look at the significance for the repeated measure and because a significant change was found between times 1 and 3 to say that the treatment had an effect. In fact, though, we can see by the non-significant interaction effect that there was not an impact of treatment because there was no difference in the change in exam scores across the levels of treatment.

There are a lot of other common mistakes but I need to go back to work so those will have to wait for another blog.

When I teach students how to use SAS to do a repeated measures Analysis of Variance, it almost seems like those crazy foreign language majors I knew in college who were learning Portuguese and Italian at the same time.

I teach how to do a repeated measures ANOVA using both PROC GLM and PROC MIXED. It seems very likely in their careers my students will run into both general linear models and mixed models. The problem is that they confuse the two and the result is buggy code.

Let’s start with mistakes in PROC GLM today. Next time we can discuss mistakes in PROC MIXED.

Let’s say I have the simplest possible analysis – I’ve given the same students a pre- and a post-test and want to see if there has been a significant increase from time one to time two.

This will work just fine:

proc glm data =mydata.fl_pre_post ;
model pretest posttest = /nouni ;
repeated time 2 ;

Coding the repeated statement like this will also work

repeated time 2 (1 2) ;

So will

repeated time ;

It almost seems as if anything or nothing after the variable name will work. That’s not true. First of all,

repeated time 2 (a b) ; IS WRONG

… and will give you an error – Syntax error, expecting one of the following: a numeric constant, a datetime constant.

“Levels gives the number of levels associated with the factor being defined. When there is only one within-subject factor, the number of levels is equal to the number of dependent variables. In this case, levels is optional. When more than one within-subject factor is defined, however, levels is required,”

SAS 9.2 Users Guide

So, this explains why you can be happily using your repeated statement without bothering to specify the number of levels for a factor and then one day it doesn’t work. WHY? Because now you have two within-subject factors and you need to specify the number of levels but you don’t know that. This is why, when teaching I always include the number of levels. It will never cause your program to fail, even if it is unnecessary sometimes.

One more cool thing about the repeated statement for PROC GLM, you can do a planned contrast super easy. Let’s say I have done 3 tests, a pretest, a post-test and a follow-up. I want to compare the posttest and followup to the pretest.

proc glm data =mydata.fl_tests ;
model pretest posttest follow = /nouni ;
repeated test_time 3 contrast (1) /summary ;

What this will do is compare each of the other time points to the first one. A common mistake students make is to use a CONTRAST statement here with test_time. This will NOT work, although it will work with PROC MIXED, but that is a story for another day.

I was going to write more about reading JSON data but that will have to wait because I’m teaching a biostatistics class and I think this will be helpful to them.

What’s a codebook?

If you are using even a moderately complex data set, you will want a code book. At a minimum, it will tell you the name of each variable, the type (character, numeric or date), a label, if it has one and its position in the data set. It will also tell you the number of records and number of variables in a data set. In SAS, you can get all of this by running a PROC CONTENTS. (Also from a PROC DATASETS but we don’t cover that procedure in this class.)

So, for the sashelp.heart data set, for example, you would see:

output from Proc contents

The variable AgeAtDeath is the 12th variable in the data set. It is numeric, with a length of 8 and the label for it is “Age At Death”. Because it is a numeric variable, if you try to use it for any character functions, like finding a substring, you will get an error. (A substring is a subset of a string, so ‘ABC’ is a substring of ‘ABCDE’.)

Similarly, BP_Status is the 15th variable in the data set, it is a character, with a length of 7 and a label of “Blood Pressure Status”. Because it’s a character variable, if you try to do any procedures or functions that expect numeric variables, like find the mean, you will get an error. The label will be used in output, like in the table below.

Frequency distribution of blood pressure status

This is useful because you may have no idea what BP_Status is supposed to mean. HOWEVER, if you use “Blood Pressure Status” in your statements like the example below, you will get an error.

**** WRONG!!!
Proc means data=sashelp.heart ;
Var blood pressure status ;

Seems unfair, but that’s the way it is.

The above statement will assume you want the means for three separate variables named “blood” “pressure” and “status”.

There are no variables in the data set named “blood” or “pressure” so you will get an error. There is a variable named “status”, but it’s something completely different, a variable telling if the subject is alive or dead.

Even if you don’t have a real codebook available, you should at a minimum start any analysis by doing a PROC CONTENTS so you have the correct variable names and types.

What about these errors I was talking about, though? Where will you see them?

LOOK AT YOUR SAS LOG!!

If you are using SAS Studio , it’s the second tab in the middle window, to the right of the tab that says CODE.

Click on that tab and if you have any SYNTAX errors, they will conveniently show up in red.

Also, if you are taking a course and want help from your professor or a classmate, the easiest way for them to help you is if you is to copy and paste your SAS log into an email, or even better, download it and send it as an attachment.

Just because you have no errors in the SAS log doesn’t mean everything is all good, but it’s always the first place you should look.

To get a table of blood pressure status, you may have typed something like

Proc freq data=sashelp.heart ;
Tables status ;

That will run without errors but it will give you a table that gives status as alive or dead, not blood pressure as high, normal or optimal.

PROC CONTENTS is a sort of “codebook light”. A real codebook should also include the mean, minimum, maximum and more for each variable. We’ll talk about that in the next post. Or, who knows, maybe I’ll finally finish talking about reading in JSON data.

When I was young and knew everything, I would frequently see procedures or statistics and think, “When am I ever going to use THAT?” That was my thought when I learned about this new procedure to transpose a data set. (It was new then. Keep in mind, I learned SAS when I was pregnant with my first child. She is now CEO of a an educational game company and the mother of three children. )

PROC TRANSPOSE is super-useful. You might only think it is useful for transforming data for use with PROC GLM to use with PROC MIXED, or you might have no idea what the hell that means and it is still super-useful.

Let me give you today’s example. I’m looking for data to use in a biostatistics class I’m teaching next month. It’s a small data set, with data on eight states included in the Center for Disease Control’s Autism and Developmental Disabilities Monitoring Network.

The data looks like this:

As you can see, each state is a column. I would like to know, for example, what percentage of people with autism also have a physical disability. There is a way to do it by finding the mean across variables but I want to use this data set for a few examples and it would be much easier for me if each of those categories was a variable.

The code is super simple:

PROC TRANSPOSE DATA=mydata.autism OUT=mydata.autism2 NAME=state;
ID eligibility ;

The NAME = option is not required nor is the ID statement but they will make your life easier.  First, let’s take a look at our new data.

Data set with one record for each state

Now, instead of state being a variable, we have one record for each state, the percent with autism diagnosis only is one  variable, percent with emotional disturbance another, and so on. What the NAME = option does is give a name to that new variable which was the name of each column. If you don’t use that option, the first column would be named  _name_  . Now, with these data it would still be pretty obvious that this variable is the state but in some cases it wouldn’t be obvious at all.

The ID statement is really necessary in this case because otherwise each column is going to be named “COL1”, “COL2” etc.  Personally, I found the ID statement here confusing because normally the ID statement I think of as the individual ID for each record, like a social security number or student ID. In this case, the variable name you give in the ID statement is going to be used to name the variables. So, as you can see above, the first column is named Autism(%), the second is named Emotional Disturbance (%) and so on.

So, that’s it. All I need to do to get means, standard deviation, minimum and maximum is :

PROC MEANS DATA =mydata.autism2;

So, that’s it.

By the way, I get this data set and a few others from SAS Curriculum Pathways. Nice source for small data sets to start off a course.


I live in opposite world, where my day job is making games and I teach statistics and write about programming for fun.  You can check out our games here. You’re probably already pretty good with division but you’ll learn about the Lakota language and culture with Making Camp Lakota.  A bilingual (English-Lakota) game that teaches math.

feather

Some people believe you can say anything with statistics. I don’t believe that is true, unless you flat out lie, but if you are a big fat liar, I am sure you would lie just as much without statistics.

However, a point was made today when Marshall and I were discussing, via email, our presentation for the National Indian Education Association. One point we made was, while most vocational rehabilitation projects serve relatively few youth, the number at Spirit Lake has risen dramatically. He said, 

You said the percentage of youth increased from 2% to 20% and then you said the percentage of youth served tripled. Which was it?

It depends on how you slice your data

There is more decision-making in even basic statistics than most people realize. We are looking at a pretty basic question here, “Did the percentage of the caseload that was youth age 25 and under, increase?”

The first question is, “Increase from when to when?”  That is, what year is the cutoff? In this case, that answer is easy. We had observed that the percentage of youth served was decreasing and changes were undertaken in 2015 to reduce that trend. So, the decision was to compare 2015 and later with 2014 and earlier.

Percent of Youth on Caseload, by Year

How much of an increase is found depends on the year used for comparison and whether we use one year or an average.

The discrepancy between the 10x improvement versus 3x comes because the percentage of youth served by the project varied from year to year, although the overall trend was going down. If we wanted to make ourselves look really good, we could compare the lowest year – 2013 at 2% with the highest year, 2015 at 20% and say the increase was 10x, but I think that isn’t the best representation, although it is true. One reason is that the changes we discussed in the paper weren’t implemented until 2015, so there is no justification for using 2013 as the basis.

The second question is how do you compute the baseline? If we use all of the data from 2008-2014 to get a baseline,  youth comprised  7% of the new cases added. At first, I used the previous year six years as baseline 2008-2014, we get 7% and if we compare that to 2015 with 20.2% the percentage of youth served almost tripled. 

However, we just started using the current database system in 2012 fiscal year and the only people from prior years in the data were those who had been enrolled prior to 2012 and still receiving services. The further back in time we went, the fewer people there were in the system, and they were definitely a non-representative sample. Typically, people don’t continue receiving vocational rehabilitation services for three or four years. 

You can see the number by year below. The 2018 figure is only through June of this year, which is when I took a snapshot of the database.

If we use 2013-2014  as a baseline, the percentage of youth among those served was 4%. If we use 2012-2014, it’s 6%. 

To me, it makes more sense to compare it to an aggregate over a few years.  I averaged 2012 through 2014 because it gave larger sample size, had representative data and also because I didn’t feel comfortable using the absolute lowest year as a baseline. Maybe it was just a bad year. As any good psychometrician knows, the more data points you have, the more reliable your measure. 

 The third question is how to select the years for comparison. I combined 2015-2018 also because it gave a larger sample size and, again,  I did not want to just pick the best year as a comparison. Over that period, 18% of those served by the project were youth.

So … what have we learned? Depending on how you select the baseline and comparison years we have either improved 10 times, from 2% to 20% , 2.6 times, from 7% to 18%,  tripled, from 6% to 18% , quadrupled, from 4% to 20% – and there are some other permutations possible as well.

Notice something here, though. No matter how we slice it, after 2014, the percentage of youth increased, and substantially so. This increase was maintained year after year. 

I thought this was an interesting example of being able to come up with varying answers in terms of the specific statistic but no matter what, you came to the same conclusion that the changes in outreach and recruitment had a substantial impact.

I’m sure I’ve written about this before – after all, I’ve been writing this blog for 10 years – but here’s something I’ve been thinking about:

Most students don’t graduate with nearly enough experience with real data.

You can use government websites with de-identified data from surveys, and I do, but I teach primarily engineering and business students so it would be helpful to have some business data, too. Unfortunately, businesses aren’t lining up to hand me their financial, inventory and manufacturing data (bunch of jerks!)

So, I downloaded this free app, Medica Scientific from the app store and ran a simulation of data for a medical device company. Some friends did the same and this gave me 4 data sets, as if from 4 different companies.

Now, that I have 4 Excel files with the data, before you get to uploading the file, I’m going to give you a tip. By default, SAS is going to import the first worksheet. So, move the worksheet you want to be first. In this case, it’s a worksheet named “Financials”. Since SAS will use the first worksheet, it could just as well be named “A whale ate my sandwich”, but it wouldn’t be as obvious.

While you are at it, take a look at the data, variable names in the first row.  ALWAYS give your data at least a cursory glance. If it is millions of records, opening the file isn’t feasible and we cover other ‘quick looks’ in class.

These steps and the next few use SAS Studio, which is super-duper helpful for online courses.

1. Upload the file into the desired directory
2. Under Tasks and Utilities select Utilities and then Import Data
3. Click select file and then navigate to the folder where your file is and click open
4. You’ll see a bunch of code but nothing actually happens until you click on the little running guy.

menus to select data to import

First select the data set

the import data window

Have you clicked the running guy? Good!

 

Okay, now you have your code. Not only has SAS imported your data file into SAS, it’s also written the code for you.

FILENAME REFFILE '/home/annmaria/examples/simulation/Tech2Demo.xlsx';
PROC IMPORT DATAFILE=REFFILEDBMS=XLSX OUT=WORK.IMPORT1;
GETNAMES=YES;
RUN;
PROC CONTENTS DATA=WORK.IMPORT1;
RUN;

Now, if you had a nice professor who only gave you one data set, you would be done, which is why I showed you the easy way to do it.

However, very often, we want to compare several factories or departments or whatever it is.

Also, life comes with problems. Sigh.

One of your problems, which you’d notice if you opened the data set is that the variables have names like “Simulation Day” .  I don’t want spaces in my variable names.

My second problem is that I need to upload all of my files and concatenate them so I have one long file.

Let’s attack both of these at once. First, upload the rest of your files.

Now,  open a new SAS program and at the top of your file, put this:

OPTION VALIDVARNAME=V7 ;

It will make life easier in general if your variable names don’t have spaces in them. The option above automatically recodes the variables to valid variable names without spaces.

Now, to import the next 3 files, just create a new SAS program and copy and paste the code created by your IMPORT procedure  FOUR TIMES (yes, four).

From Captain Obvious:

Captain Obvious wearing her obvious hat

Although you’d think this would be obvious, experience has shown that I need to say it.

  • Do NOT copy the code in this blog post. Copy the code produced by your own IMPORT procedure, it will have your own directory name.
  • Do NOT name every output data set IMPORT1 because if you do, each step will replace the data set and you will end up with one dataset and be sad.

Since I want to replace the first file, I’m going to need to add the REPLACE option in the first PROC IMPORT statement.

OPTION VALIDVARNAME=V7 ;

FILENAME REFFILE '/home/annmaria/examples/simulation/Tech2Demo.xlsx';
PROC IMPORT DATAFILE=REFFILEDBMS=XLSX
REPLACE
OUT=WORK.IMPORT1;
GETNAMES=YES;
RUN;
PROC CONTENTS DATA=WORK.IMPORT1;
RUN;

FILENAME REFFILE '/home/annmaria/examples/simulation/Tech2Demo2.xlsx';
PROC IMPORT DATAFILE=REFFILEDBMS=XLSX
REPLACE OUT=WORK.IMPORT2;
GETNAMES=YES;
RUN;
PROC CONTENTS DATA=WORK.IMPORT2;
RUN;

Do that two more times for the last two datasets

Did you need to do the utility? Couldn’t you just have done the code from the beginning? Yes. I just wanted to show you that the utility existed. If you only had one file and it had valid filenames, which is a very common situation, you would be done at that point.

In a real-life scenario, you would want to merge all of these into one file so you could compare clinics, plants, whatever. Super easy.

[IF you have write access to a directory, you could create a permanent dataset here using a LIBNAME statement, but I’m going to assume that you are a student and you do not. The default is to write to the working directory. ] ;

DATA allplants ;
set import1 - import4 ;

IF you get an error at this point, what should you do?

There are a few different answers to that question and I will answer them in my next post.

SUPPORT MY DAY JOB . IT’S FUN AND FREE!
YOU CAN DOWNLOAD A SPIRIT LAKE DEMO FOR YOUR WINDOWS COMPUTER FROM THE MICROSOFT STORE

Next Page →