If you came into my office and watched me work today, just before I had you arrested for stalking me, you might notice me doing some things that are the absolute opposite of best practices.
I need about 10 datasets for some analyses I’ll be doing for my SAS Global Forum paper. I also want these data sets to be usable as examples of real data for courses I will teach in the future. While I’m at it, I could potentially use most of the same code for a research project.
The data are stored in an SQL database on our server. I could have accessed these in multiple ways but what I did was
1. Go into phpMyAdmin and chose EXPORT as ODS spreadsheet.
2. Opened the spreadsheet using Open Office, inserted a row at the top and manually typed the names of each variable.
Why the hell would I do that when there are a dozen more efficient ways to do it?
In the past, I have had problems with exporting files as CSV, even as Excel files. A lot of our data comes from children and adolescents who play our games in after-school programs. If they don’t feel like entering something, they skip it. That missing data has wreaked havoc in the past, with all of the columns ended up shifted over by 1 after record 374 and shifted over again after record 9,433. For whatever reason, Open Office does not have this problem and I’ve found that exporting the file as ODS, saving it as an xls file and then using the IMPORT DATA task or PROC IMPORT works flawlessly. The extra ODS > Excel step takes me about 30 seconds. I need to export an SQL database to SAS two or three times a year, so it is hard to justify trouble-shooting the issue to save myself 90 seconds.
IF YOU DIDN’T KNOW, NOW YOU KNOW
You can export your whole database as an ODS spreadsheet. It will open with each table as a separate sheet. When you save that as an XLS file, the structure is preserved with individual sheets.
You can import your data into SAS Enterprise Guide using the IMPORT DATA task and select which sheet you want to import. Doing this 2, 3 or however-many-sheets-you-have times will give you that number of data sets.
WHY TYPE IN THE VARIABLE NAMES?
Let me remind you of Eagleson’s law
“Any code of your own that you haven’t looked at for six or more months might as well have been written by someone else.”
It has been a few months since I needed to look at the database structure. I don’t remember the name of every table, what each one does or all of the variables. Going through each sheet and typing in variable names to match the ones in the table is far quicker than reading through a codebook and comparing it to each column. I’ll also remember it better.
If I do this two or three times a year, though, wouldn’t using a DATA step be a time saver in the long run? If you think that, back up a few lines and re-read Eagleson’s law. I’ll wait.
Reading and understanding a data step I’d written would probably only take me 30 seconds. Remembering what is in each of those tables and variables would take me a lot longer.
I’ve already found one table that I had completely forgotten. When a student reads the hint, the problem number, username and whether the problem was correctly answered is written to a table named learn. I can compare the percentage correct from this dataset with the rest of the total answers file, of which is a subset. Several other potential analyses spring to mind – on which questions are students most likely to use a hint? Do certain students ask for a hint every time while others never do?
Looking at the pretest for Fish Lake, I had forgotten that many of the problems are two-part answers, because the answer is a fraction, so the numerator and denominator are recorded separately. This can be useful in analyzing the types of incorrect answers that students make.
The whole point of going through these two steps is that they cause me to pause, look at the data and reflect a little on what is in the database and why I wanted each of these variables when I created these tables a year or two ago. Altogether, it takes me less time than driving five miles in Los Angeles during rush hour.
This wouldn’t be a feasible method if I had 10,000,000 records in each table instead of 10,000 or 900 variables instead of 90, but I rather think if that was the case I’d be doing a whole heck of a lot of things differently.
My points, and I do have two, are
- Often when working with small and medium-sized data sets, which is what a lot of people do a lot of the time, we make things unnecessarily complicated
- No time spent getting to know your data is ever wasted