Finishing up my second paper for WUSS next month and I have been thinking about the usefulness of character functions in a world where it sometimes seems like everyone is just put on this earth to irritate the hell out of me.
Take this problem, for example,
In analyzing the data for our games, we have all sorts of beta testers – teachers, staff, interns – who played the game but their data should be deleted from the datasets for the annual report. We asked them to use the word TEST in their username so it would be easy to pull them from the data. Some of them did and some apparently feel that I just say these things of exercise for my mouth.
There is also a problem with data entry errors. The subjects in this study were children in grades three through six and they frequently mistyped their usernames.
SAS has a wealth of character functions and this is a first opportunity to get to know and love four of them.
The UPCASE function, not surprisingly, changes the value of a variable to upper case. The COMPRESS function, if you give it only the variable as an argument, will remove blanks from a value. You can, however, include additional characters to remove. Since many of the students entered their names on some days as JohnDoe and others as John.Doe , we are removing both blanks and periods using the COMPRESS function, after we have converted them to upper case.
username = COMPRESS(UPCASE(username),'. ') ;
Then there is the INDEX function. Here is a general tip. Any time you find yourself thinking,
“Gee it would be nice if SAS did thing X”,
it is a pretty good bet that someone else thought the same idea and there is a function for it. The INDEX function is a perfect example of that. Our testers played the games many, many times and used usernames like “tester1”, “this.test”, “skippy the tester” or “intern7”.
“Wouldn’t it be nice if there was way to find out whether a given string appeared anywhere in a value?”
Enter the INDEX function, which does exactly that. This function is case-sensitive, but since we already converted the username to upper case above, that is no problem for us.
IF INDEX(username, “TEST”) > 0 or INDEX(username,”INTERN”) > 0 THEN DELETE ;
will do exactly what we want. The INDEX function returns a number that is the starting position in the string of the substring we are trying to find. So, in “skippy the tester”, the value is 12, in “tester1” it is 1. If the string is not found, the value is 0.
A problem I found when looking at the contents of each of the 8 datasets used for my research project was that the username variable was not the same length in all of them, which could cause problems later when they were to be merged together or concatenated. All of the usernames should have been a maximum of 12 characters but there were data entry problems when students would type mister_rogers instead of mr_rogers.
When the data are read in using PROC IMPORT, “For delimited files, the first 20 rows are scanned to determine the variable attributes. You can increase the number of rows scanned by using the GUESSINGROWS data source statement.”
Wouldn’t it be nice if there was a way to just get the first n characters of a string?
newid = SUBSTR(username, 1, 12) ;
will create a new variable with the first 12 characters of the username, now that we have gone and fixed the problems with it.
SAS is chock full of functions and options to make your life easier. If you are just beginning to work with SAS and you spend time working with messy data, you probably couldn’t spend your time much better than taking a few hours to read up on SAS character functions. In fact, I think for someone new to SAS, becoming familiar with a large number of all types of functions – character, statistical, date and time – is probably the fastest way to improve one’s productivity. (Ron Cody’s book, SAS Functions by Example, is a great resource). I’ve lost count of the number of times when reviewing a student’s program I’ve seen many lines of completely unnecessary code that could have been replaced by a SAS function – if the student only knew that it existed.