I know people who are so obsessive about testing and validating their code to the point they spend more time on testing it than actually writing it and analyzing the output. I said I know people like that, I didn’t say I was one of them. However, it is good practice to validate your SAS code and despite false rumors spread by my enemies, I do it sometimes.

Here is a simple example.  I believed that using the COMPRESS function with “l” for lower case or “I” for case-insensitive gave the same results. I wanted to test that. So, I ran two data steps

DATA USE_L;
set mydata.aztech_pre ;
q3 = compress(Q3,’ABCDEFGHIJKLMNOPQRSTUVWXYZ’,’l’);
q5 = compress(Q5,’ABCDEFGHIJKLMNOPQRSTUVWXY’,’l’);

… and a whole bunch more statements like that.

Then, I ran the exact same data step but with an “I” instead of an “l”  .

Finally, I ran a PROC COMPARE step

PROC COMPARE base =USE_L compare=USE_I ;
Title “Using l for lowercase vs I for insenstitive” ;

PROC COMPARE RESULTS SHOW NO DIFFERENCES

But, hey, maybe PROC COMPARE just doesn’t work. Is it really removing everything whether it is upper or lower case? To test this, I ran the procedure again comparing the dataset with the compressed results with the original data set.

PROC COMPARE base =mydata.aztech_pre compare=use_I ;
Title “Comparing with and without compress function” ;

The result was a whole lot of output, which I am not going to reproduce here, but some of the most relevant was:

  Values Comparison Summary                                                      
                                                                                                                                    
Number of Variables Compared with All Observations Equal: 24.                                     
 Number of Variables Compared with Some Observations Unequal: 16.                                  
Number of Variables with Missing Value Differences: 10.                                           
Total Number of Values which Compare Unequal: 694. 

Looking further in the results, I can see comparison of the results for each variable by observation number

          ||  q5                                                                              
           ||  Base Value           Compare Value                                              
       Obs ||  q5                    q5                                                        
 ________  ||  ____________          ____________                                              
            ||                                                                                  
         5  ||  150m                  150                                                       
         6  ||  42 miles              42                                                        
        10  ||  one thousand                                                                    
        12  ||  200 MILES             200       

So, I can see that the data step is doing what I want, which is removing all of the text from the responses and only leaving numbers. This is important because the next step is comparing the responses to the questions with the answer key and I don’t want any mismatches to occur because the student wrote ‘200 miles’ instead of 200.

In case you are interested, this is the pretest for two games that are used to teach fractions and statistics. You can find Aztech: The Story Begins here and play it for free, on your iPad , Mac, Windows or Chromebook computer.

Mayan god
Play Aztech !

Forgotten Trail can be played in a browser on any Mac, Windows or Chromebook computer.

Some people believe you can say anything with statistics. I don’t believe that is true, unless you flat out lie, but if you are a big fat liar, I am sure you would lie just as much without statistics.

However, a point was made today when Marshall and I were discussing, via email, our presentation for the National Indian Education Association. One point we made was, while most vocational rehabilitation projects serve relatively few youth, the number at Spirit Lake has risen dramatically. He said, 

You said the percentage of youth increased from 2% to 20% and then you said the percentage of youth served tripled. Which was it?

It depends on how you slice your data

There is more decision-making in even basic statistics than most people realize. We are looking at a pretty basic question here, “Did the percentage of the caseload that was youth age 25 and under, increase?”

The first question is, “Increase from when to when?”  That is, what year is the cutoff? In this case, that answer is easy. We had observed that the percentage of youth served was decreasing and changes were undertaken in 2015 to reduce that trend. So, the decision was to compare 2015 and later with 2014 and earlier.

Percent of Youth on Caseload, by Year

How much of an increase is found depends on the year used for comparison and whether we use one year or an average.

The discrepancy between the 10x improvement versus 3x comes because the percentage of youth served by the project varied from year to year, although the overall trend was going down. If we wanted to make ourselves look really good, we could compare the lowest year – 2013 at 2% with the highest year, 2015 at 20% and say the increase was 10x, but I think that isn’t the best representation, although it is true. One reason is that the changes we discussed in the paper weren’t implemented until 2015, so there is no justification for using 2013 as the basis.

The second question is how do you compute the baseline? If we use all of the data from 2008-2014 to get a baseline,  youth comprised  7% of the new cases added. At first, I used the previous year six years as baseline 2008-2014, we get 7% and if we compare that to 2015 with 20.2% the percentage of youth served almost tripled. 

However, we just started using the current database system in 2012 fiscal year and the only people from prior years in the data were those who had been enrolled prior to 2012 and still receiving services. The further back in time we went, the fewer people there were in the system, and they were definitely a non-representative sample. Typically, people don’t continue receiving vocational rehabilitation services for three or four years. 

You can see the number by year below. The 2018 figure is only through June of this year, which is when I took a snapshot of the database.

If we use 2013-2014  as a baseline, the percentage of youth among those served was 4%. If we use 2012-2014, it’s 6%. 

To me, it makes more sense to compare it to an aggregate over a few years.  I averaged 2012 through 2014 because it gave larger sample size, had representative data and also because I didn’t feel comfortable using the absolute lowest year as a baseline. Maybe it was just a bad year. As any good psychometrician knows, the more data points you have, the more reliable your measure. 

 The third question is how to select the years for comparison. I combined 2015-2018 also because it gave a larger sample size and, again,  I did not want to just pick the best year as a comparison. Over that period, 18% of those served by the project were youth.

So … what have we learned? Depending on how you select the baseline and comparison years we have either improved 10 times, from 2% to 20% , 2.6 times, from 7% to 18%,  tripled, from 6% to 18% , quadrupled, from 4% to 20% – and there are some other permutations possible as well.

Notice something here, though. No matter how we slice it, after 2014, the percentage of youth increased, and substantially so. This increase was maintained year after year. 

I thought this was an interesting example of being able to come up with varying answers in terms of the specific statistic but no matter what, you came to the same conclusion that the changes in outreach and recruitment had a substantial impact.