There is not nearly enough replication in scientific research. It’s unfortunate that funding agencies and academic journals always want to see a new twist – a different technique, a different population. Personally, I’m very interested in reading studies that say:
“I did the exact same study as Mary Lou Who and I found pretty much the same thing.”
One reason this is interesting is that it controls for the history effect. Maybe a specific event determined the outcome. A second reason I find replication interesting is that people are very quick to generate causal hypotheses to explain relationships after the fact. In a subsequent study, those hypotheses can be assessed. Do they still stand up?
Here is an example that comes up in my personal life a lot. People assume since Darling Daughter Number 3 is on TV and in the movies a lot that it helps my business.
Let’s take a look at the graph below:
This shows website statistics for The Julia Group site. Those lines are average daily visits to this site in months when my little pumpkin had UFC world title fights. I used average daily visits to control for the fact that some months have more days than others. Contrary to expectations, the months when she had fights I had stagnant or declining number of visitors. Hearing this, some of the same people who had suggested her career would have a positive effect on business, without blinking an eye reversed themselves and said it must be because I was distracted and away from the office during those months.
Let’s replicate that graph with data from 2012-2013. You see a pretty similar trend between the top and bottom lines. Over the past couple of years, visits have been rising, so the average daily visitors is higher than in 2012-13 but the pattern is the same – an increase during the months from September to December and fewer visits in the summer months. December 2012 was a little unusual compared to most years – usually there is a drop over the holidays.
Because I see these same trends year after year, I realize it’s not at all attributable to how much Ronda is in the media in a given month. It’s a seasonal trend. Since I write about statistics and programming a lot, I’m pretty sure more people come to this blog during the academic year when they are taking a class. Also, people can read my blog at work and pretend it is work-related, even if I’m just ranting about something that day, because, hey there is a possibility that it COULD be about something relevant.
This assumption is further supported by the fact that the lowest days of the week for website visits are Saturday and Sunday.
It’s also interesting when you don’t find the same thing
If one defines “interesting” as not getting what you want, I had an interesting experience with a research project recently. Replicating the project a second year, we ran into all kinds of technical difficulties and the results were far from significant. In short, the subjects did not receive the planned intervention so no effect of intervention was observed. Much swearing ensued. I’m now analyzing data from the third year of the same project.
Multi-year studies make so much more sense to me and it troubles me that there are not more of them. I understand the reasons. For one thing, there is so much pressure to publish in many institutions that people put out as many articles as they can as quickly as they can (everyone except for YOU, of course). They are expensive and it is hard to justify funding to study something you already supposedly studied and reported the results.
Yeah, I get it, but just like those people who confidently explain my website statistics, without replication it is too easy to be persuaded that one’s first, or completely contradictory second, hypothesis is correct.
I’m giving a talk on Preparing Students for the Real World of Data at SAS Global Forum next month.
You’d think 50 minutes would be long enough for me to talk, but that just goes to show you don’t know me as well as you think you do. One point made in the template for papers is that you should not try to tell every single thing you know about the DATA step, for example, because it will bore your audience to death.
Random Tips That Didn’t Make it Into the Paper
1. CATS removes blanks and concatenates
While I did give a few shout outs to character functions, it was not possible to put in every function that is worth mentioning. One that didn’t make the cut is the CATS function.
The CATS function concatenates strings, removing all leading and trailing blanks.
Let’s say that I want to have each category renamed with a leading “F” to distinguish all of the variables from the Fish Lake game. I also want to add a ‘_’ to problems 10-14 so that when I chart the variables 11 comes just before 12, not before 2 (which is what would happen in alphabetical order). So, I include these statements in my DATA step.
IF problem_num IN(11,10,12,13,14) THEN probname = CATS(‘F’,’_’,probname);
ELSE probname = CATS(game,probname) ;
Now when I chart the results you can see the drop off in correct answers as the game gets more difficult.
2. Not all export files are created equal
Nine of the ten datasets I needed I was able to download as an EXCEL file and open up in SAS Enterprise Guide. It was a piece of cake, as I mentioned last time. Unfortunately, the third file was download from a different site and it had special characters in it, like division signs, and the data had commas in the middle of it. When I opened it up in SAS Studio it looked like this.
Fixing it was actually super simple. This was an Excel file. I simply did a Replace ALL and changed the division signs to “DIV” and the commas to spaces. The whole thing took FIVE lines to read in after that.
3. Listen to Michelle Homes and know your data
filename fred “/courses/abc123add/sgf15/sl_pretest.csv ” ;
Data pretest keyed;
LENGTH item9 $ 38. ;
infile fred firstobs = 2 dlm=”,”;
input started $ ended $ username $ (item1 – item24) ($) ;
Thank you to the lovely Michelle Homes for catching this! As she pointed out in the comments, the input statement assumes that the variables are 8 characters in length and character data. This is true for 26 of the 27 variables. However, ONE of the 24 items on the test is a question that can be answered with something like Four million, four thousand and twelve.
That, as you can see, is over 8 characters. So, I added a LENGTH statement. That brought up another issue, but that is the next post …
I’ll have a lot more to talk about in Dallas. Hope to see you there.
Want to be even smarter? Back us on Kickstarter! We make games that make you smarter. The latest one, Forgotten Trail, is going to be great! You can get cool prizes and great karma.
If you came into my office and watched me work today, just before I had you arrested for stalking me, you might notice me doing some things that are the absolute opposite of best practices.
I need about 10 datasets for some analyses I’ll be doing for my SAS Global Forum paper. I also want these data sets to be usable as examples of real data for courses I will teach in the future. While I’m at it, I could potentially use most of the same code for a research project.
The data are stored in an SQL database on our server. I could have accessed these in multiple ways but what I did was
1. Go into phpMyAdmin and chose EXPORT as ODS spreadsheet.
2. Opened the spreadsheet using Open Office, inserted a row at the top and manually typed the names of each variable.
Why the hell would I do that when there are a dozen more efficient ways to do it?
In the past, I have had problems with exporting files as CSV, even as Excel files. A lot of our data comes from children and adolescents who play our games in after-school programs. If they don’t feel like entering something, they skip it. That missing data has wreaked havoc in the past, with all of the columns ended up shifted over by 1 after record 374 and shifted over again after record 9,433. For whatever reason, Open Office does not have this problem and I’ve found that exporting the file as ODS, saving it as an xls file and then using the IMPORT DATA task or PROC IMPORT works flawlessly. The extra ODS > Excel step takes me about 30 seconds. I need to export an SQL database to SAS two or three times a year, so it is hard to justify trouble-shooting the issue to save myself 90 seconds.
IF YOU DIDN’T KNOW, NOW YOU KNOW
You can export your whole database as an ODS spreadsheet. It will open with each table as a separate sheet. When you save that as an XLS file, the structure is preserved with individual sheets.
You can import your data into SAS Enterprise Guide using the IMPORT DATA task and select which sheet you want to import. Doing this 2, 3 or however-many-sheets-you-have times will give you that number of data sets.
WHY TYPE IN THE VARIABLE NAMES?
Let me remind you of Eagleson’s law
“Any code of your own that you haven’t looked at for six or more months might as well have been written by someone else.”
It has been a few months since I needed to look at the database structure. I don’t remember the name of every table, what each one does or all of the variables. Going through each sheet and typing in variable names to match the ones in the table is far quicker than reading through a codebook and comparing it to each column. I’ll also remember it better.
If I do this two or three times a year, though, wouldn’t using a DATA step be a time saver in the long run? If you think that, back up a few lines and re-read Eagleson’s law. I’ll wait.
Reading and understanding a data step I’d written would probably only take me 30 seconds. Remembering what is in each of those tables and variables would take me a lot longer.
I’ve already found one table that I had completely forgotten. When a student reads the hint, the problem number, username and whether the problem was correctly answered is written to a table named learn. I can compare the percentage correct from this dataset with the rest of the total answers file, of which is a subset. Several other potential analyses spring to mind – on which questions are students most likely to use a hint? Do certain students ask for a hint every time while others never do?
Looking at the pretest for Fish Lake, I had forgotten that many of the problems are two-part answers, because the answer is a fraction, so the numerator and denominator are recorded separately. This can be useful in analyzing the types of incorrect answers that students make.
The whole point of going through these two steps is that they cause me to pause, look at the data and reflect a little on what is in the database and why I wanted each of these variables when I created these tables a year or two ago. Altogether, it takes me less time than driving five miles in Los Angeles during rush hour.
This wouldn’t be a feasible method if I had 10,000,000 records in each table instead of 10,000 or 900 variables instead of 90, but I rather think if that was the case I’d be doing a whole heck of a lot of things differently.
My points, and I do have two, are
- Often when working with small and medium-sized data sets, which is what a lot of people do a lot of the time, we make things unnecessarily complicated
- No time spent getting to know your data is ever wasted
We were driving to the hospital to get some tests done and complaining about the traffic with Colorado Ave. closed for a couple of blocks for construction on the new train line. The Invisible Developer, brilliant, as usual, commented,
At least we have the luxury of worrying about every day things.
He was right, of course. After a few hours in the hospital, this was even more evident. There are a thousand reminders of how lucky we are. In the bathrooms, there is a cord to pull in case you need assistance. Let that sink in for a moment – there are procedures in place just in case going into use the restroom turns out to be beyond your physical capabilities.
Sorry to tell you, fellow citizens, but north Santa Monica is to Los Angeles like Florida is to the rest of the country – a place where old people go to die comfortably. This area must have the most people using walkers per square block outside of, well, Florida.
Everything turned out fine and by evening we were at The Fish Co with our granddaughter drinking Chardonnay and eating oysters (well, she was drinking milk and eating cherries).
Today was a sort of unproductive day. I worked on PHP code that did not work all day. By the end of the day, I had some ideas but nothing that actually ran. I worked on two different problems and didn’t solve either of them. Much swearing ensued. I cannot find the photoshop file for a piece of artwork anywhere and I need it modified. Our wonderful artist is on vacation in Peru for another week.
We are out of dishwasher soap and the housekeeper comes tomorrow so someone needs to go to the store and buy cleaning supplies.
Maria had a baby and is writing a book and has been unavailable for several months.
All of my problems are nothing.
The book Maria is writing is a memoir with my other daugher, Ronda, who has been quite successful. Maria is a brilliant writer and the book is selling well months prior to publication. If it doesn’t make the best-seller list, I will be shocked.
I have problems to solve because we have work. I live in an area with low crime, good weather and a good economy, which is why we have construction and traffic. People want to live here.
Years ago, when The Spoiled One was about 11 years old, she had an infection and there was a very brief period – about 24 hours – when she was in the hospital getting all sorts of tests, including for leukemia. Lots of very kind people tiptoed around us talking in hushed tones. It turned out to be nothing serious. We went home and back to the luxury of worrying about every day things.
This week, she was accepted on a club soccer team, turned 17 years old, took her SAT and was awarded a scholarship (again) for her fourth and final year of a college prep school that she will appreciate much more once she actually goes to college.
We took a picture of her in the hospital and I keep it to remind me that it is a luxury to be able to worry about every day problems.
As further proof that God has a sense of humor, my career has been full of reversals. Where I was once the pain-in-the-ass young hotshot who knew everything and thought my boss was stuck in the past century, now I have to deal with people like that.
For my first few years as an employee, I thought that managers were pretty much leeches on the productivity of the “real workers” like me.
How could they claim to be busy all of the time when they weren’t actually making anything?
These days, I have to fight to get an hour or two to actually write code, and yet, I often work 12-14 hour days.
What do CEOs do all day? Let me give you a not-so-brief list, not at all in order because it never is in order.
- Monitor budgets. I meet with our accountant, usually by phone, and review files she sends documenting where our expenditures are in comparison with budgets for each line item – supplies, travel, developer salaries, marketing expenses. It’s my job to see that we don’t run out of money. Because I am the owner of one company and CEO of a separate corporation, I make sure that expenses are apportioned to the right entity. I look over our corporate tax returns.
- Review contracts and documents. Speaking of tax returns, there are a number of documents – tax returns, federal reports, contracts for employees and freelancers, rental agreements – that bind the corporation in some manner and require the signature of someone with that authority, that being me. Because I am not an idiot, I read all of these before I sign on the dotted line.
- Answer questions requiring approval. Do we want to extend Joe’s contract as an animator/ software developer/ janitor ? If so, how much do we want to pay him? Should he get a raise? Has he done a really bad job this year and should we consider letting his contract lapse and replacing him? Do we want to continue paying for a license for Unity / Coherent UI / Adobe Creative Suite etc etc. Some of these discussions are very quick and some take an hour or more.
- Answer questions on priority. What do I want Mary to work on first? Is the new radio commercial more important than the video for the Kickstarter campaign? Should Sue document the module she just finished on the wiki before going on to the next part of the game or is our deadline just so tight that she needs to knock that level out immediately? Again, some of these discussions take a while. Is there someone else who can do the documentation while Sue goes back the previously level and debugs that? Is there anyone on the project part-time that could work more hours?
- Calls and meetings with people who are very important to our company. These can be people who give us money, potentially give us money, representatives from schools that our beta test sites. No matter what you do, there are people who you really want to keep happy because they are critical to your organization. You don’t want to take the chance that they will be given the wrong information and put off to tomorrow because the person they are meeting with doesn’t have the authority to make a commitment.
- Meetings with people within the company. We have meetings weekly or bi-weekly with staff just for communication. Everyone needs to know what repository we are using for the latest game, who is in charge of starting the section of the wiki for that, who is doing the artwork and where it is stored and dozens of other things. Yes, maybe we could send out email or create a Google doc, but a meeting insures that as of noon on Monday everyone knew all of these things.
- Applying for money. I spend probably 20% of my time on this. Some days it is 0% and other days it’s 100%. This may be grantwriting, attending a meeting with an investor to determine if this is a good fit for us.
- Being the public face of your company. This can be presenting at a conference, doing an interview with the press or a guest speaker at a meeting. If you are a start-up, your biggest competitor is apathy. Any way you can increase awareness that you exist is time well spent.
- Administrivia. This is my name for all of the stuff that somehow collects and needs to be done. Email from people I met who I may or may not want to respond to and ever meet again – but I need to read it. Invitations to present at some conference, contract offers I may want to decline. Most of these things I can glance at and delete, but I get hundreds of emails a day. Over the past couple of months, I have brought my unread emails down from 1,600 to under 1,000. In-box zero, here I come!
- Questions no one else seems to be able to answer. What’s our EIN number? Are we a C-corp or an S-corp. What’s the password for our SAM account?
Multiply each of these by a dozen times and you see why I’m writing this blog at 3 a.m.
Some people may have said that hackathons are a stupid ass idea where a bunch of people who have can’t afford to buy their own pizza spend 48 hours with a bunch of strangers and no showers.
Okay, well, maybe that was me.
I take it all back.
We kicked off our hackathon at noon on Monday and wrapped up at 8 pm on Tuesday. The rules were simple – everyone who was working those days was to wipe their schedule completely for 8 hours each day and do nothing but work on the game. No emails, no blog posts, no meetings except for a kick off meeting each day to assign and review tasks. Jessica, Dennis, Samantha and I worked on the game for (at least) 16 hours. Any emails or interviews got done before the hackathon hours or after they were over. (I did pause for a brief interview with the Bismarck State College paper.)
Maria came in from maternity leave and worked 8 hours on Monday, baby in tow.
Gonzalo and Eric each worked their regular shifts on Monday and Tuesday, respectively, doing nothing but writing code, creating sprites and editing audio. Sam even pitched in a few hours early in the morning from Canada. Our massively talented artist, Justin, completed all of the new artwork before the meeting so we had it in hand to drop into all of the spots where there had been placeholders.
So, in two days a total of 100 hours were devoted just to game development. We made a giant leap forward.
Why did it work so well? For one thing, we were all in the same spot for a long time. Although the original plan was to meet and then people would go there separate ways, on Monday, five of the six people working stayed at my house. Three of us even slept there. That had two positive impacts.
First of all, whenever anyone needed something, whether it was a piece of artwork modified or a question answered on whether we had a sound file of footsteps in the woods or to be shown how to do a voice over in iMovie, there was someone else to provide that assistance right on the spot. Very often, you can spend hours searching for something on Google, watching youtube videos, reading manuals trying to figure out how to do X when someone else can come up and say – Click on Window, pick record voiceover, click on the microphone in the middle of the left side of the window.
There are also those questions that CANNOT be found on Google, like where the hell was the new background image saved and what is it called.
The second positive impact was we got around to tasks that needed doing for a long time. While it may have seemed it kept us from getting real progress done on the game, the fifth time Sourcetree complained about not tracking those damned Dreamweaver .idea files, I HAD it and we removed those from the repository forever. When something bugs you every now and then you may think, “I’ll do it later”, but the fifth time it happens that day …
Anyway, I would share more of the awesomeness of the hackathon experience with you but it is now 9 pm and we are taking the team out for sushi.
In case you don’t know, SAS On-Demand is the FREE , as in free beer, offering of SAS for academic use. How good is it? There really can’t be one answer to that.
First of all, there are multiple options – SAS Studio, SAS Enterprise Miner, SAS Enterprise Guide, JMP, etc. so some may be better than others.I have a fair bit of experience with two of them, so let’s just look at one of those today.
I mostly use SAS Studio with my students and over the past few courses I have been really pleased with the results. I selected SAS Studio over Enterprise Guide because I strongly believe it is useful for students to learn to code and many students, yes, even in an area like biostatistics need a little encouragement to learn. While they don’t end up expert SAS programmers after two or three courses, they at least can code a DATA step , read in raw data, aggregate data and data from external files, produce a variety of statistics and graphics and interpret the results.
Let’s be frank about this … it’s going to require a bit of work up front. You need to create a course with SAS On-Demand. You need to notify your students that they need to create accounts. If you are not going to use solely the sashelp directory data sets, you’re going to have to upload your own data.
Please don’t tell me you plan on solely using the sashelp data sets! These are really helpful for the first assignment or two while students get their feet wet but unless you expect your students to have careers where all of their files to be analyzed are going to be shipped with the software they use, you’re going to move to reading in other types of data sooner or later.
Your data are going to be stored on the SAS server (so you can tell people who ask that yes, you are ‘computing in the cloud’ – instead of what I usually tell people who ask stupid questions like that, which is shut the hell up and quit bothering me – but I digress. Even more than usual.)
No matter what software you use, you’re going to have to select some data sets for students to analyze, have some sort of codebook and make sure your data is reasonably clean (but not so clean that students won’t learn something about data quality problems). So, the only real additional time is figuring out how to get it on the SAS server.
None of these steps take much time, but adding them all up – getting a SAS profile, creating a course, creating an email to send to all of your students, with the correct LIBNAME, uploading your data – it all maybe adds up to a couple of extra hours.
My challenge always is how I shoehorn additional content into the very limited class time I have with students. One tool I’ve been using lately is livebinders. This is an application that lets you put together an online binder of web pages, videos and material you write yourself.
Here is an example of a livebinder I use for my graduate course in epidemiology. It has SAS assignments beginning with simply copying code to modifying it . Links to the relevant SAS documentation are included, as are videos that show step by step how to use SAS Studio for computing relative risk, population attributable risk, etc. I have a similar livebinder for my biostatistics course.
You might think this is a bit of hand-holding to walk the students through it, but I would disagree. Every time I have found myself thinking,
“Well, this is a little too easy”,
I have been wrong.
If you have been doing something for a decade or, in my case, a few decades, it’s hard to remember how confusing concepts were the very first time. Even things that you do automatically, like downloading your results as an HTML file, were a mystery at one time in your life. Making the videos takes some time initially – you have to do a screencast, and then the voice over. Sometimes I do them at once, using QuickTime and GarageBand simultaneously. Other times, I import the screencast into iMovie and record a voiceover.
Either way, a 7-minute video usually takes me half an hour to record, when you add in screwing up the first time, editing out the part where The Spoiled One came in and asked for money to go shopping, etc. So, you’re adding maybe 3-4 hours to the time you spend on your course. On the other hand, you only have to do it once, so, if you teach the same course a few times, it pays off. I cannot tell you how many times students tell me that the videos were helpful. Unlike when I am lecturing in class, they can slow the video down, play it over.Students end the course with experience coding, using data from actual studies and interpreting data to answer problems that matter.
My point is, that it is a little more work to teach using SAS Studio, but it is worth it.