# Statisticians need an Occam, or at least a razor

My initial exposure to Occam’s razor came in my first undergraduate economics class. Perhaps due to my tender years, it made a great impression, and I have tried to apply it ever since. In short, Occam’s razor advises us when presented with competing, plausible choices, the default choice should be the simplest one.

Plausible is a key word in that sentence. So, don’t go writing down the answer to every statistics homework problem as 42 and saying Dr. De Mars told you to use Occam’s razor.

When sorting through propensity score macros this weekend, I was struck by how much the solutions for one problem can vary in complexity.

The concept of propensity scores is not that difficult. You have two groups, people who went to the emergency room and people seen in urgent care centers, or people who visited a museum exhibit on the holocaust and people who didn’t. They can be any two groups but the key point is this – they were not randomly assigned and they are not equivalent.  If you want to decide if visiting the Museum of Tolerance changes a person’s views on diversity, government or anything else you need to consider that people who choose to visit the museum are probably different from people who don’t. Similarly, people who went to the ER are probably different in many ways from people who go to an urgent care center – severity of injury, income, insurance and so on.

So … you do some type of statistical analysis and get a score that takes all of these variables into account, called a propensity score. You try to match the two groups on this score as closely as possible. If you have 2,000 people who went to an ER and 12,000 who went to urgent care centers, you would select the 2,000 from urgent care who were the closest to the Emergency Room sample. (You can also select more than one match for each person, but remember, we are keeping it simple.)

Most people don’t have much trouble grasping the LOGIC of propensity score matching, but then the complexity begins. You can match on exact scores.

“We matched people who had the same score.”

This is my preferred method right out of the gate because it fits the AnnMaria Scale (my personal version of Occam’s razor) which is,

“Can you explain what you did in 25 words or less, without using any words the reader has to look up in a dictionary?”

(Be impressed that the AnnMaria Scale also fits the AnnMaria Scale.)

On my latest project, I used a modification of the custom macro published by Lori Parsons and presented at the SAS Users Group International (now known as SAS Global Forum).  She finds scores that match down to the fifth decimal place and spits out those into the final data set. Next, her macro pulls out scores that match down to the fourth decimal place, then to three decimal places, then two, then 1.

If the propensity score isn’t even close – subject A has a score of  .2   and subject B has a score of .8 and there are only these two subjects left without a match, then no match occurs.

• It’s easy to understand
• For data sets with a large control group and lots of close matches it makes a lot of sense and you can get a very closely matched control group
• It’s not very computer-resource intensive. With a data set with 15,000 records, it took about a second to run on my iMac running boot camp with 4GB of RAM on Windows 7 with a 32 bit operating system. (Translation, it took one second on a not very fancy system.)

Really, if you can match 75% of your subjects down to the fifth decimal place on the propensity score and 90% of them within three decimal places,  I think any additional precision you get in matching is going to be trivial.

• You may not match everyone.
• If your data are very different on the variables included in your equation to create the propensity scores, a substantial proportion of your subjects may be unmatched.

This seems to occur a lot more often in medical studies than in the social science types of research I am usually working on. If you’re looking, for example, at people who elect to be in an experimental treatment study, they may be far sicker (and thus more willing to try an unproven drug or procedure) than people who go with a more orthodox treatment. People who are airlifted somewhere are probably much different from those who don’t get helicopter transport. So, how do you create groups when you are NOT going to find many close matches?

One way is to use calipers. In short, you match people who are within a given range, say within a propensity score of .10, or as used by another program, within .25 standard deviation of the logit. A second custom macro option, which, was, as far as I can see, perfectly appropriate for the data in their study, performed this type of match using a combination of principal components analysis, cluster analysis, logistic regression, Proc SQL and hash tables.

PROC SQL made perfect sense because you are doing a many to many match. You want to match all of the people who fall within a given range with all of the other people. Hash tables, I usually consider unnecessary because their main advantage is increasing speed and whether my program runs in 1.5 seconds instead of 1.8 seconds is usually not a major concern of mine. Right now, this program without hash tables I am running as a test is on its second hour. So, hash tables make a lot of sense IF you are running on a computer with limited processing speed. (I have learned to test things on the worst piece of crap equipment I can find and thus spare clients that old canard of programmers and IT support “It worked on my system.” If it works on all of our test systems it ought to run on anything short of a solar-powered graphing calculator. )

What the folks in the second article were trying to do was find the optimal possible match. Given the problem with their data, that made perfect sense. If, as in the data I have at the moment, you can match your subjects very closely without all of the extra steps, it makes sense to do so.

There is a second thing that bothers me. I remember all the way back to an article on Analysis of Covariance I read in graduate school. The article was written even further back still, in 1969, by Janet Elashoff. Her main point was that analysis of covariance is ” a delicate instrument”.

While I understand from a statistical and programming point of view what researchers are trying to do with some of the optimizing algorithms for matching, the premise is troubling. Some of these groups being compared are very different.

Elashoff was most concerned (and I agree) with situations where the two groups depart very far from the assumption of random assignment. We don’t have any new, magical mathematics today that makes that no longer a concern.

If you use a high-performance computer to crank through your population of 26,856 control subjects to find the 843 patients who are optimally like the 843 people in your study who elected to undertake a very risky treatment at a minimum you have a non-random sample of the population of people who did not elect treatment. It’s also plausible that they are very different from  people in who did receive treatment in other ways. The person who is willing to take any chance, no matter how remote, regardless of potential side effects, to find a cure and the one who is not interested in something that promises the possibility of six more months of life of questionable quality – those might be quite different people. All propensity score discussions mention the concern about “unobserved differences” and then scurry on to complex mathematics. I think they should linger a bit.

In some cases, though, it seems that there are NOT really many very good matches. You really can’t statistically control for the difference in intelligence between a genius and a moron using analysis of covariance – or anything else.

Curiously, it reminds me of the scene from Cool Runnings where the captain of the Jamaican bobsled team is whacking his teammates helmets. When one of his teammates asked him why he was doing that, the captain said it was because the Swiss team did it and they won a lot of races. To which the teammate responded,

“Yes, well they make them little knives, too, and I don’t see you doing that.”

There’s a lot of interest currently in applying these methods developed in medical research to social science. Some of those ideas are good, but some ought to be examined a lot more before we start widely applying them.

# Random basic SAS tips on ranuni, sampling with replacement & junk

Filed Under Software | 12 Comments

1. Use the data sets in the sashelp directory for dummy data.

2. The RANUNI function is worth remembering

Today I needed to check something.  Specifically, I was using the ranuni function to generate random numbers for sampling with replacement. I wanted the data set to be sorted randomly, select the first match, then re-sort the data and re-sample. (Yes, obviously, I was doing propensity score matching.)

When you use a 0 for the seed, e.g.,

randnum = ranuni(0) ;

SAS actually uses the time of day. I was not sure to how many seconds, micro-seconds, nano-seconds or whatever it rounds the time. Even if I did, I’m not sure that would have helped me because I wanted to know is the rounding factor small enough that if my program is running with a small data set there is already a new seed by the next step. I *thought* so but you know, there is a reason we do testing.  It’s because things don’t always come out the way you think.

I was going to create a dummy data set and then I realized, hey! There are all kinds of data sets already in the sashelp directory. You may never have looked at them, or noticed, but they  are there. So, I did this :

```Data yes ; set sashelp.air ; randnum = ranuni(0) ; run ; proc means data = yes ; var randnum ;```

``` Data yes ; set yes ; randnum = ranuni(0) ; proc means data = yes ; var randnum ; run ;```

I did the PROC MEANS because I was too lazy to open the two data sets and look at the numbers. Yes, it worked. I expected it would.

3. Avoid the unconditional ELSE statement.

This is a habit I got into years ago. I’m not one for giving other people rules for how to code because I think most of those rules given out as gospel are just one of many acceptable ways of doing things. This one, however, is worth remembering. Say you have two possible conditions, experimental and control. You would *think* it would make sense to type

`If group = 1 then output experiment ;`

`else  output control ;`

It would make sense to you if you have never met any actual people. There are data entry errors, where someone types 11 or the letter “l” instead of a 1. People type “experiment” instead of 1. They leave that field blank because they don’t know if this person was in the experimental or control group. All of those people end up in the control group. The technical term statisticians use for this state of affairs is “bad”.

Instead, at a  minimum, do this.

`If group = 1 then output experiment ;`
``` else if group = 0 then output control ;```

I worked with someone who had a good habit of always creating one more data set than he needed, he named it junk. Then,  at the end of every IF statement that sent data to different groups was this

`If group = 1 then output experiment ;`

`else if group = 0 then output control ;`

``` ```

`else output junk ;`

Not only did your 1s only go to the experimental group and your 0s only go to the control group but you also had a dataset that collected the junk where you could look at who these people were and try to figure out what their problem was. Personally, I don’t do that as a routine, but it is a good habit.

I live just south of Malibu, a city filled with the type of houses God would have built if he only he’d had enough money. Often, I’ll take the afternoon off and go hiking in the Santa Monica mountains, peeking into the backyards, tennis courts and golfing greens of the rich and richer. One thing they all have in common – no one is home. They are all at the office, I suppose, getting richer, while the multi-million dollar views are enjoyed by me. Oddly, perhaps, I’ve never had the slightest inclination to change the way I do business so I could change places with them.

One comment I received on twitter recently was that I obviously don’t understand business, because I questioned why no one makes a computer or phone in America.

I just bought both a new iPhone and a new iMac. If I’d had the option to buy one made in America, even if it was 20% more, I would have done it. I don’t like the idea that people are producing what I use in conditions we would not allow in this country.

Years ago, I was working at a company I had co- founded. We had hired a new person right out of his Ph.D. program and he kept coming up with ideas for research that would get shot down at our staff meetings. The marketing director or president would object,

“This is a for-profit company. How are we going to make money on that?”

Finally, he nodded knowingly and said,

“Okay, I get it, we do anything that makes money.”

One of my partners corrected him and said,

“No, everything we do makes money. We don’t do anything that makes money. There’s a difference. We went into on-line education and sold courses for caregivers of people with chronic illness to make money.  It also provided better health care and helped people get training to get jobs they needed. We would not have done a marketing program for tobacco products because that would have made the community better, not worse.”

We’ve never out-sourced work. We’ve had a ton of offers. I’m on my third company. 2012 will be my 27th straight year of profits. I don’t have a house like Getty’s Villa. Of course, neither does he, because he’s dead.

At some point, you decide what type of company you want to be. The same partner and I had this discussion while drinking champagne in Union Station.  We’d just hit a million dollars in sales. He said,

“If I could go back to myself as a little boy on the reservation and say, forty years from now, you’ll have a doctorate, you’ll be in Washington, D.C., president of a million-dollar company, waiting to go have your picture taken with your senator on Capitol Hill – there’s no way I would have believed it.”

Neither my two partners nor I, given how far we had come in life, could see any sense in out-sourcing work to some other country rather than paying people on the reservation \$50,000 a year, just so we could live better. From where we sat in Union Station, it seemed that we were already living pretty damn good.

So, no, I don’t do anything that makes money, although everything I do makes money.

The next thing that came up on twitter, and often face to face, is people telling me that I don’t charge enough in consulting fees. I’m not even shy about it. I charge \$100 an hour. There, I said it. For volume discounts (which is the ONLY kind of discount I give) I charge \$75 an hour. If I have to wear a suit or leave my house, I charge extra. A LOT extra.

I hate those websites and services that are all coy about what they charge. “Call for quote.” Actually, don’t call for a quote. I don’t mean to be rude or ungrateful but I’m booked. I have a new research assistant coming in the morning. I have added someone to help with editing. I have another person helping with research in North Dakota. We have a couple of bids under review and if those get funded, we’ll add four more people – I already have them lined up. At the moment, my next available time is in May.

People tell me that proves I should charge more. Maybe. On the other hand, I have had many clients for ten years or more. Except for the two contracts I signed last week, all of our clients have been with us for at least three years. Yes, I know lots of consultants who charge lots more. I also know that I am charging the most some of my current clients can pay. It wouldn’t be right to charge clients from large organizations much more than I charge say, a small vocational rehabilitation project.

A lot of those people who brag to me about how much they make spend a good bit of their time preparing bids, attending events to schmooze with customers.  Figuring in the time that those other people spend chasing business, their effective consulting rate is probably about half what they actually get paid. Running down the projects posted on the white board in my office, I can see that every single one came to us in the same way – they called me, we had a discussion, they signed a contract. When the economy went south a few years ago, I didn’t lose a single client.

I submitted a couple of proposals in the last six months because it was work I really, really wanted to do and I think it is really important. That is more important to me than money. (Of course, only someone whose bills are already paid can say that.)

Here’s another thing about charging a reasonable (my colleagues say less than reasonable) rate – you can choose to work with good people. I used to say that if you were a jerk there was up to a 100% surcharge for putting up with you. People used to think I was joking, but I wasn’t. Now, I don’t work with the jerks at any price. Life is too short.

One last thing from twitter – about hiring preferences, discrimination, women in tech, Latinos in tech, Native Americans in tech – whatever. Lot of tweets on that. The three people who I hired lately to pick up some slack are two Latinas and one Native American. Those are people I know. The other four people I have bid in the proposals under review are two Native Americans and two Latinos.

In Clay Shirky’s “Rant about Women” that I also was referred to from twitter, he gives examples of men who basically lie themselves into a position claiming expertise they don’t have. It is, unfortunately, not that uncommon in my experience. That is exactly what I DON’T want and why I tend to hire people who come with personal references.

People don’t set out to discriminate. They tend to hire people like them because that is who they know. If you grow up on a reservation, you know a lot of Native Americans. If you grow up in East Los Angeles, you know a lot of Latinos.

Not everyone   who works with me is Latino, though. The rocket scientist retired and now he is picking up some of the programming and web development responsibilities. So, you can get hired here if you are white. You just need to sleep with me.

The last thing I see a lot on twitter is the lamenting all of the “time wasted” on twitter. Who says so? Who says it’s wasted? If in an average couple of days I can see tweets that have me thinking about our business philosophy, pricing and hiring, who is to say that’s a waste? Maybe not

# ASA’s New Look : It’s not your father’s statistical association

Filed Under statistics | 11 Comments

It’s been 15-20 years since I was last a member of the American Statistical Association. I read an article in their journals occasionally but not much of it is relevant to me. I work with clients who are designing surveys, analyzing messy data and evaluating programs. They do research but it is not in a laboratory or with accommodating undergraduates seeking ten points extra credit. It’s more often with people with substance abuse issues or learning difficulties who are not too excited about research, the researchers or anything related. My clients are not interested in some esoteric statistical technique that three people in the world will ever use or simulations using perfect data that demonstrate method X has a standard error of .337 while method Y has a standard error of .436.

As for me, I find some of that mildly entertaining, but, in my view, most academic journal articles are written by very intelligent people who are reinforced at every level of their education and career for writing that is deliberately obscure. I remember a professor joking (I hope he was joking!) when a student told him that he had to read his latest article twice before he understood it,

“Good! That means I can count it twice on my c.v.”

All of that being said, one might wonder how I found myself on  a rainy Monday afternoon driving from Santa Monica to Irvine, on the 405, at rush hour. Generally, this is the sort of thing people don’t undertake voluntarily unless God is appearing at the Irvine amphitheater, or, at the very least, the Grateful Dead.

Actually, it was neither. It was a meeting of the southern California chapter of the American Statistical Association. Brenda Osuna, the senior statistical consultant at USC, had sent me an email asking if I was going. I wasn’t planning on it, but I took a look at the invitation and saw that the speaker was the new ASA president, Bob Rodriguez, who is also some mucky-muck statistical something at SAS. So, that sounded a little interesting. Then, I saw the topic was on big data and business analytics. That looked a lot more interesting, and not the typical ASA journal thing, so I was intrigued. Intrigued enough to pay \$189 to join ASA again and drive down there.

Was it worth it? Yes. I’d say so. Here is a summary of my tweets during the talk. Yes, this is a lazy way to do a post for the day, but since I took several hours away from business to attend, there is work that needs to be done. Here are my notes, comments from me are in italics.

Bob Rodriguez, ASA President speaking at University of California, Irvine on Business analytics and big data: What statisticians need to succeed

• Rodriguez advises student on job interviews to ask the interviewer about the types of data they have, the problems they have with data. He says whatever their industry, once people get started talking on that topic it is hard to get them to stop.
• Graduate programs confer 2,200 statistics degrees annually but there is a need for 160,000 more people with expertise in advanced analytics, data mining and statistical analysis. Where are we going to get those people? (I think Rodriguez is right. There is a shortage of statisticians. Business is through the roof for our company. (He didn’t ask, but I do wonder, WHY are our own programs not recruiting and graduating more people? You can’t tell me we only have about 2k smart people out of every graduating class of college seniors.)
• Rodriguez gives the example of optimizing store markdowns as a statistical problem. ( I am really impressed because on first mention, I feel myself starting to snore, but … ) … he casts it as a problem with big data – tens of thousands of items, hundreds of millions of individual transactions, with mixed effects with stores as  a random effect – and I start getting fascinated. THIS is the type of teaching we need in statistics!
• What is big data? Big data is different from “not-so-big” data in volume, velocity and variety – AND it is increasing on all three dimensions.
• One way to attract new people to the field is to show them the number of really interesting, challenging problems in statistics/analytics/programming. (He couldn’t be more right here. This is why I need 30 hours in a day.)
• (In response to a question from the audience on how to handle big data …) Three possibilities. One is distributed processing across multiple computers/ nodes. A second is to co-locate the analyses with the data / in-database computing. The third is rewrite internal SAS procedures, for example, so it does not depend on the SAS supervisor. Rodriguez is leading a group that is hard at work on that.
• High demand areas for graduates are design & analysis of survey data, computation & large data computing and econometrics.
• Young statisticians need to be able to present the relevance of their work and to write concisely and clearly. (I agree, but I think the preferences of academic journals in general, including those from ASA, and most professors are complicit in having seen that has NOT happened up to this point. That’s a rant for another day.)
• 70% of enterprise data is unstructured: images, email, documents. This is why SAS is investing so much in text mining.
• (Comment from someone from the UCI Medical School – (I think) – we find very few statisticians who can write a good statistics section of a grant or an article. )

In summary  — it seems like the American Statistical Association is changing quite a bit from the last time I paid attention to it 20 years ago or so. I’m sure these changes have been in the works for a long time, but it is like any organization or company, once you decide it doesn’t meet your needs, you quit paying attention to it unless something comes up – in this case the invitation from Brenda.

I’m also going to the Joint Statistical Meetings for the first time ever. I even got roped into being a discussant on a panel. (Brenda, again). JSM is another thing I was aware of forever but there were always plenty of options of conferences to attend and to present, and it always seemed like SAS Global Forum, the American Educational Research Association, SPSS Directions, the National Council on Family Relations – I could go through my c.v. and come up with a couple of dozen places I’ve presented over the years, and another dozen conferences I attended that seemed more relevant to my business and the research I was doing than JSM.

Since I just re-joined ASA after *decades*, I would say that their efforts to broaden their appeal worked with me. So, yeah, if you dismissed them years ago as irrelevant, maybe you want to take a new look. From what I can see, it’s no longer your father’s American Statistical Association any more. Worth checking out.

# Making Statistics Interesting (No, really!) with SAS On-Demand

Filed Under Software, statistics | 1 Comment

In a few (okay, a lot) of my previous posts I talked about how you could get set up with SAS On-Demand, problems you might have, programs to run. Now we get to the crux of the matter. Why?

Let’s assume that you are like most professors in America and your students are like most students. In other words, you are teaching statistics to people who:
a) don’t want to learn it
b) consider it boring
c) believe it doesn’t apply to them
d) think it’s too hard
e) all of the above

I am not being disrespectful to my students. If you have ever taken or taught a statistics class, can you honestly tell me that the above does not apply to one hell of a lot of people.

Here, on day one, is a graph from the very first data set analyzed. It’s a chart showing the distribution of medical visits from a sample of people 65 and older.

What questions can you ask based on this?

What observations can you make?

Oddly, I find many people jump right in talking about health care, socialized medicine or aging without ever once pausing to ask what exactly is this variable, which is the FIRST question they ought to ask.

The answer is that it is the total number of internal medicine visits over a nine year period. The first thing the students can see is that most people had very few, even over a nine year period. Then, they can look at the table of descriptive statistics and see that the mean is 24.5 visits even though the mode is clearly 0.

Both this table and the chart above are a result of the CHARACTERIZE DATA task, which is always the first thing I do with any data set if I am using SAS Enterprise Guide. It gives me an overview of the dataset – number of variables, range, minimum, maximum, mean, distributions – and identifies any glaring data entry errors.

This is the point where hopefully one of your students asks if the mode really is 0.It is definitely the minimum, as they can see from the table. It is the highest point, but is that actually zero or is it something like 0-5 ?

This is the part that almost never happens in any statistics or any other math class, the part where the instructor says, “Let’s find out.” (In fact, the modal number of visits over a nine-year period is zero, but that only represents less than 4% of all people over age 65.  You can find this out by going to TASKS > DESCRIBE > ONE-WAY FREQUENCIES and selecting the medicine visits variable as your analysis variable.)

Above we have another chart that shows a normal distributions super-imposed on the distribution for medicine visits. You can see that the distribution is not normal at all. In fact, the left end of the curve is cut off your chart entirely. Thus, you can see that your distribution is positively skewed and non-normal.  In a normal distribution, your measures of central tendency are all the same.  Mode= Mean = Median.  In this distribution the mode = 0, Mean = 25 and median = 21.  You get the above chart, measures of central tendency and a lot more by going to TASKS > DESCRIBE > DISTRIBUTION ANALYSIS , selecting the medicine visits variable as your analysis variable, under DISTRIBUTION click the button next to NORMAL and under plots click the button next to Histogram Plot.

A distribution analysis also gives you various statistics to test for normality which you may or may not cover in an introductory course. However, anyone can see in the graph above that part of the curve is missing so that is definitely not a normal distribution.

Maybe everything is like that. Maybe most older people are healthy and there are a few who have a terminal illness, multiple health problems, whatever.

What do you think? Here it comes again … let’s find out.

Now here is the distribution of cholesterol. That certainly looks to be closer to a normal distribution. The mean = 245, median = 241 and mode = 230 .  When we take a look at a normal distribution super-imposed on the distribution for cholesterol, you can see that it is a heck of a lot closer to normal than the distribution for medical visits.

Hopefully by this point your students are starting to see that this statistics stuff is far from impossibly hard. They may have questions about the distributions of other variables, like blood pressure, BMI or other medically interesting stuff. These are questions they can be assigned to research themselves after you show them how to use SAS On-Demand.

You’ve looked at a couple of variables, but you haven’t yet discussed one of the real uses of descriptive statistics, that is, to describe your sample. Who are these people?

In this particular sample, we have two cohorts, one selected from people who were 65 or older in 1971 and a second cohort of people 65 or older in 1980. A fair number of them had been born in the 19th century. What were these people like? How did they differ from you and me?

Again, this is an opportunity for your students to ask and answer questions. Many of the answers are in those graphs from the CHARACTERIZE DATA task so you don’t need to run anything at all. Just go to the results and scroll down. For example …

The distribution of education was vastly different. The largest proportion of this population had not finished high school. Far less than 10% had finished college.

We’re working with patients who in many ways were born into a different world than the one we live in. In what other ways might they be different ?

Oops, too late. Class time is over.

If this is the part where students say,

“Wait a minute! That’s not fair. We didn’t find out the answer yet!”

Well, that’s a hell of a good start to  a statistics course, don’t you think?

# Guest Post: SOPA by Julia De Mars

Filed Under Dr. De Mars General Life Ramblings | 1 Comment

Last week, I was relaxing in bed reading a book when my daughter, Julia, came storming into the room. She demanded,

“Mom! Have you heard of SOPA?”

I informed her that yes, in fact, I had. She said she had read about it on Google and written a letter to her Congressman. Students in her eighth-grade class are required to select an article each week and write a summary on a current event. They take turns presenting their article and this week was Julia’s turn. I liked her presentation so much, I asked her to use it as a guest blog. This is only the second guest post I have ever had.

What?
SOPA= Stop Online Piracy Act
This bill means that if any copyrighted material is uploaded on the internet, the website it’s on could be taken down and the person who uploaded it could go to jail.

My current event is about SOPA , the response from the American public and the Chinese reaction to that response.

The article I used as reference is “The Chinese view of SOPA” from The New Yorker.

The author wrote, “Chinese’s reactions to American protests … ranged from sympathy to gentle Schadenfreude.” Some Chinese Web users are saying the American government is trying to copy them or think that they are ahead of the world, and we are racing to catch up.

Why?
● A lot of music industries and other entertainment thought they could make a lot more money if they could get SOPA passed.
● Websites are protesting because they don’t want their web sites taken down.
● The Chinese reacted the way they did because they have censorship there and we’ve always claimed to be so much better than them as far as not having censorship and now we’re starting down the same path.

How?
● The protests started with a few programmers creating a code to blackout their websites and passing it on to other web sites.

FUN FACT!!!(:
● Guess who else is opposed to SOPA?…
Justin Bieber!
Because if SOPA were passed when he was younger making youtube videos of himself singing Chris Brown songs they would be taken down and he would have been thrown in jail and never become famous.:(…(THANKYOU)

So, yeah – STOP SOPA !

Here is a PDF file of the presentation: SOPA

(My dad made the pdf file before I had a chance to fix a couple of things like misspelled words but they WERE fixed for the presentation for my class, just so you know.)

The other night, I was annoyed. This is a common state for me, so not too surprising.

I was annoyed because I could not log on to SAS On-demand. I would click the icon on my desktop and the log in screen would come up with my username filled in. About a minute after I typed in my password, the log in screen would come back up again.

I logged into my SAS Profile account using the same username, so it was obviously not a problem with my password. (As you will discover, “obviously” may have not been the word I was looking for here.)

I checked to see if the SAS server status was active (it was).

I tried logging in on a second computer to see if maybe the problem was with my computer. (No luck.)

I tried Google to see if anything came up on the search “Can’t log in to SAS On-Demand”  but all it gave me was stuff I already knew was not the problem, e.g., make sure your internet connection is working (check), make sure you have SAS On-Demand installed on your computer (check) , make sure you have the right password (check), make sure the server is up (check).

Since it was 2 a.m., I figured hey, maybe their server is down for maintenance. Went to bed, worked all day, forgot about it and then had the SAME problem the next night at which point I emailed SAS tech support.

However, when you try to log in, nothing  tells you that your password is expired and since the SAS profile password does NOT expire, you might reasonably assume when you use it to log into your SAS profile, as I did, that it is working and has not expired. If you cannot remember when was the last time you changed your password, just go on the assumption it has been over 180 days and change it.

Now what?

1. Just quit being a whiner and go change your password.  Go here to change it. https://www.sas.com/profile/user/login.htm

2. That sounded like a good idea, unfortunately, I STILL could not log in to SAS On-Demand. I logged into the control center. It showed my account was active. It showed that my institution was Pepperdine. All good, right? However, the course I was teaching in the fall is over. I was using SAS On-Demand in part to prepare for my SAS Global Forum Hands-on Workshop and partly to test exercises and demonstrations for the following semester.  You might think (reasonably) that a professor would have no need for SAS On-Demand for Academics when the semester is over. That presumes that you are only planning your course once the semester has already started, which is a pretty silly assumption. One thing I learned from getting a Ph.D. with three children under age five in the house is to never, never leave anything to the last minute because that is exactly when two of them will come down with chicken pox.

So  — if you find yourself at  2 a.m. unable to log on, you have tried everything you can think of  and are all out of Chardonnay, don’t despair.  Change your password (NOTE: This can take up to 30 minutes to take effect. Go have another glass of Chardonnay.)  If that doesn’t work, register a course, even if you are not teaching it this semester.  Although I had very good luck and was able to log on within the hour, you DO get a notice saying it may take up to a day for your course to be available. (Go have  LOTS of Chardonnay.)

Thank you very much to Derek Hardison at SAS Tech Support for his assistance.

Thank you in general to the SAS Technical Support for not putting my phone number in the list of those they block. I know they have the technological know-how to figure out how to do it!

# Learning from the mistakes of others

Filed Under Technology, The Julia Group | 2 Comments

Three times within a very short period I have seen technology offerings that I thought were an epic fail. In one case I was a beta tester, in one, a regular user and in the third a personal friend of the founder. All of them, I thought, had brilliant ideas that fit a real need in the market. In every case, I was disappointed because I really, really wanted them to succeed. All three products had the potential to help people, and ended up falling far short.

We’re working on a new product in our company right now so I have the utmost sympathy for the brains and laboring into the wee hours of the morning required to get anything to market. I’m also aware that the best-laid plans of the smartest people can go awry and one reason I’m not pointing fingers and calling people out is that I don’t want to eat those words a year or so from now when we have our roll out.

Instead, I’m trying to follow my grandmother’s advice. She told me everyone has to learn from mistakes and learning from the mistakes other people make is a lot less painful than having to make every mistake yourself. (It was something like that, but there was some Spanish mixed in there and I am pretty sure she was whacking me as she said it.)

In all three cases, I think the problem was they released a product without being ready for prime-time. In one case, the reason was the founder needed money and didn’t want to wait to start earning an income, any income. In another case, the company seemed to be trying to beat a competitor to market (I am also a beta tester for the competitor) and I’m not sure about the third.

Lesson learned 1: If your product isn’t ready, beating a competitor doesn’t do you any good. Yes, the company got a lot of press as the first, but most of the people who tried it will be unimpressed. There were a lot of features missing that would have been included if they had taken more time.  This brings me to …

Lesson learned 2: Know what’s a feature and what’s a frill. I hear the mantra of “fail fast” and “just ship”.  That’s a “p” at the end of that “just ship” , not a  “t” ! It’s probably okay to release your product without world-class graphics (for most products, anyway). You can make your website prettier later. Then, there are the corners you can’t cut, like documentation that explains how the thing works. If links to pay with a credit card are broken then no one pays you and that is bad.

Lesson learned 3: Plan for the 20%. You know that old saying that 80% of your business comes from 20% of your customers? It’s probably also true that 20% of your customers will account for 80% of your features. With all three products, it was obvious that there were hooks in for what was to be done “later”. Who will notice we don’t have lesson 20 done the first week? Who will click on every link on our site? Who will be using this at 3 a.m. ?How likely is it that 300,000 people will try to access our server at once the first week? How many people will try a mixed model with nested effects?

I’ll be the first to admit that I’m not the typical user of technology. I remember when I worked at USC, when we would get a new version of SAS or SPSS or JMP or Windows 7 or an iPad or Chrome, there were a couple of my co-workers and I who would immediately say, “Let’s try to break it!” We’d see what would happen if we opened a data set with 2,000,000 records or 67 browser windows at once or imported an Excel dataset written in Korean. Yeah, we’re a bunch of weirdos. I learned two related lessons from that, though.  First, any feature you claim to have, someone will try it, and the very limits you claim. Second, those people are probably far more likely to buy your product because they are the ones enthusiastic about technology.

Putting it all together – I am far more impressed with the company that has not come out with its product yet. It may suck in the end. It may be worse than the ones that rushed to market, who knows? Right now, though, they at least get points from me for not having made the mistake of rolling out a product that “could be amazing” but falls short.   I’m also reinforced in the path we have been taking for the past several months doing a lot of consulting work to pay the bills that affords us to take the time we need for research and development. In the end, I’m twice as determined that anything we release will be tried and tested before it sees the light of day on any of our client’s computers. That isn’t to say that we won’t be putting some little bits and pieces here and there, like a free game on the app store, but they will be exactly what they are advertised to be –  a little bit of something that does some small thing amusingly well. That’s my main take-away point – be everything you advertise yourself to be NOW, on the day you release, not six weeks or six months later.

It reminded me of a comment I once overheard a coach make to an athlete who wanted to be paid to train so he could win some event. Steve told him,

“We don’t pay for potential in this country. No one pays you for what you are going to be able to do. We care about what you are capable of doing right now.”

So, right now being 3 a.m., I’m going to get some sleep before my 10 a.m. conference call, do some work to pay the bills and then return to my going back and forth between working on this design and coding to make it happen.

Filed Under Open data, Software | 2 Comments

(If you are too young to remember the song, “Money for nothing and your chicks for free”, I guess the title of this post is not nearly as amusing to you as it was to me, as my lovely daughter so unhelpfully pointed out.)

Since I whined yesterday about Codecademy not providing much explanation of the code in the Quickstart course (where not much is defined as none at all), I thought I should not be so hypocritical.

I am often posting some code and saying I will explain it later. I noticed some of those from 2009 or 2010. Well, 2012 is definitely later. I think I’ll work backward though and start with what I promised to explain “later” this week.

As I’ve rambled on here a lot, open data is a great idea, but it takes some work. I decided to post some of what I’ve been doing here with explanation. In part, this was motivated by a talk I had today with some researchers I’ll be working with over the next few months. Someone said, quite correctly,

“You read the journal article and they say  they did propensity score matching but they never tell you exactly how they did it, how they modified the code, which macro they used, because that’s not really the focus of the article. Unfortunately, when you go to replicate that model, you can’t because there isn’t enough detail.”

So, here in great detail from beginning to end is how I banged the data into shape for the analyses of the Kaiser-Permanente Study of the Oldest Old. These are data I downloaded from the Inter-university Consortium for Political and Social Research. (Free, open data).

The analysis was all done using SAS On-Demand for Academics.

`LIBNAME mydata "/courses/u_2.edu1/i_123/c_2469/saslib" ;`

The LIBNAME statement specifies the directory for my course on the SAS server. This is going to be unique to your course. If you have a SAS On-Demand account and you are the instructor, you know what this directory is. The “mydata” is a libref – that is, it is just used in the program to refer to the library a.k.a. directory. You can use any valid SAS name. Actually, I am not using anything in the class directory in this example, but I put it the first line in every program as a habit so when I DO need the data in the directory available, I have it.

``` OPTIONS NOFMTERR ;```

This prevents the program from stopping when it cannot find the specified format. SAS Enterprise Guide is generally pretty forgiving about format errors, and I used a .stc file which should have the formats included, but I usually include this option anyway. If there are missing formats, you’ll get notes in your SAS log but your program will still run.

``` FILENAME readit "/courses/u_2.edu1/i_123/c_2469/saslib/04219-0001-data.stc" ; PROC CIMPORT INFILE = readit LIBRARY = work ISFILEUTF8=TRUE ;```

This imports the file, complete with formats. I rambled on about this in an earlier post. In short, because this particular file was created on a different system, you can EITHER have the formats OR have it in your permanent (i.e. course) directory, but not both. Click on the previous link if you need detail on .stc files or CIMPORT. Otherwise, move on.

``` PROC FORMAT ; VALUE alc 3 = "1-2" 2 = "3-5" 1 = "6+" ;```

This is creating a format for one of the variables in the data set. As you can see, the variable was coded 1= 6 or more drinks a day, 2 = 3-5 drinks a day. I want to change that format so the actual values print out, instead of “1” for the people who had six drinks.

``` DATA work.alcohol ; SET work.DA4219P1 ;```

The previous CIMPORT step created that dataset DA4219P1 . I am reading it into  a new data set, named alcohol, that I’m going to change and create variables for my final dataset to analyze. Everything I am doing in this program COULD be done with tasks in SAS Enterprise Guide, but I found it more efficient to do this way.  These are both in temporary (working) memory.

`ATTRIB b LENGTH  = \$10. c1dthyr LENGTH = 8.0 c2dthyr LENGTH = 8.0;`

In the ATTRIB statement, I am defining new variables and specifying the length and type. You don’t have to do this in a SAS program but if you don’t, SAS will assign length and type when it first encounters the variables, and it may not do it exactly the way you want.

``` IF alcopyr = 0 OR alcohol = 0 THEN amntalc = 0 ; FORMAT amntalc alc. ;```

The variable amntalc was missing for a large number of people, but most of those people had previously answered “no” to the questions asking if they drank alcohol in the previous year or ever in their life. If they said, “no” to either of those, I set the amntalc, which is how much they drink per day, to zero. This dramatically cut the amount of missing data. Then I applied the format.

```death = INPUT(COMPRESS(dthdate),MMDDYY10. ) ; b = "01/" || COMPRESS(brthmo) || "/" || COMPRESS(brthyr) ; bd = COMPRESS(b) ; brthdate = INPUT(b,MMDDYY10.) ; lifespan = (death - brthdate) / 365.25 ; lifespan = ROUND(lifespan,1) ; ```

All the above calculates two variables I actually care about and a couple of others I’m going to drop. Death is the date of death. Lifespan is how old the person lived to be. First, I read in the date of death, which had been stored as a text field, that’s what the INPUT function does, and the mmddyy10. gives it the format in which to read the data. I stripped out the blanks (that’s what the COMPRESS function does). Now I have the date of death.

The file doesn’t have actual birth days, just birth month and year. So, b = 01 plus the month and year – assigning everyone the birthdate of the first day of the month when they were born. The statement with birth date reads that as a SAS date. Now I am going to subtract the birthday from death date and divide it by 365.25 to give me how many years the person lived. Finally, I am going to ROUND the result to the nearest year.

Yes, I could have combined a lot of these statements into one. For example, I could easily combine the last two statements calculating lifespan. I did it like this because I use SAS On-Demand to teach and if my students peek at the program, which many do, it is easier for them to understand broken down like this.

``` if alcopyr = 1 then alcopyr = . ; if smcigars = 1 then smcigars = . ; if educ in (3,4,5) then education = 4 ; else education = educ ;```

The above recodes some variables. For a couple, “1” meant the data were not available, so I changed that to missing. For education, I combined three categories so my data were the more typical categories of “less than high school”, “high school” etc.

``` IF cohort = 1 then do ; IF death = . then c1dthyr = 12 ; ELSE  c1dthyr = YEAR(death) - 1971 ; yrslived = c1dthyr ; dthy1 = YEAR(death) ; end ;```

``` else IF cohort = 2 then do; IF death = . then c2dthyr = 12 ; ELSE c2dthyr = YEAR(death) - 1980 ; yrslived = c2dthyr ; dthy2 = YEAR(death) ; end ;```

``` ```

There were two cohorts, one for which data collection began in 1971 and one in 1980. I wanted the option to either analyze two different variables (and possibly split the data set later), so I created two variables named c1dthyr and c2dthyr. I also wanted one variable because I wanted to be able to compare survival rates by cohort, so I created a variable named yrslived. I was working with a student who was interested in deaths in a specific range of years, so I created variables dthy1 and dthy2 for her. The YEAR function returns the year part from a SAS date. A DO loop for each cohort took care of all of that.

``` DROP  b bd F_A_X01 -- I_AUTPSY MR1_CS1 -- A_F_DT1 D_A_DT2 -- H_DISDT4 MRCSDT1 -- MR3_DT4 IREVMVA -- NTIS5 LABsum1 -- E_MRDX Vis1y1 -- vis9y9  E_CR1 -- PRCS43  B_INTR -- MRCSDDT23 ;```

The last statement drops a bunch of variables I don’t need. The somevar — othervar  notation with the two dashes in between includes all of the variables in order in the dataset from the first variable mentioned to the second. There were literally hundreds of variables I wanted to drop, so here they are all dropped. Now, I have the file I want and I am happy and ready to start running stuff for my first day of class.

Filed Under Dr. De Mars General Life Ramblings, Software | 9 Comments

Seriously, someone has to say something about Codecademy ‘s courses. I feel like the little kid in the story The Emperor’s New Clothes. You know, the one where supposedly the clothes are made from cloth only really intelligent people can see, so no one wants to point out the emperor is actually naked.

With a week or so into the new year, 100,000, – no 500,000 – no a GAZILLION people are learning to code, and it includes the Mayor of New York, the president, the pope! Okay, well, maybe that is a bit of an exaggeration. Out of these 300,000 or 500,000 or 800,000 users, depending on what article you read, 99% of the tweets I read on twitter said things like,

“This is great! I’m learning to code!”

or

“I just unlocked badge #10. This is so wonderful!”

I have to agree with Ms. Watters review in (Not) learning to code that the people giving it rave reviews either aren’t past lesson two or already know how to code. Let me give two examples:

This first one is from the JavaScript Quick Start Guide, which is for people who already know some other language. I saw this after I had already completed the beginning programming course and I was happy to find it because I was interested in something that moved a little quicker. This little block below is all of the explanation on Do/While

7. Do/While
This is similar to most other languages:

do { // block} while(condition);

Implement the increment one the last time.

Then there is this exercise, which, as you can see, I did complete correctly.

// This function should increment the start value by 1
// the number of times specified.
function increment(start, timesToIncrement) {
// Add the appropriate code here, this time using a
// do/while loop. This time, you must also write in the loop
// body.
var i = 0 ;
do {
start++ ;
i++ ;
} while( i < timesToIncrement);
return start;
}

I actually was able to do it correctly because I had done it in Javascript before and I am taking the Codecademy courses because I would like to learn Javascript better, review material and do more with it. As I said in another blog, I’m also doing a lot of other stuff to learn more. My point is, if I didn’t already know how to do this I would be lost. There may be better ways to do this, although the little Codecademy program said it was correct. That kind of bugs me, because I would like to know if there is a better way to do this. I understand the reason for the start++ and i++ both being there because start may not always be equal to zero. This is a function, after all, and start is a parameter passed to it.

Then there is this in lesson 2 of the Functions in JavaScript course

…..To check that, run the code as is. You will see that the returned value is NaN, which stands for not a number. That happens to be the value returned when we try to multiply a string three times with itself, in the return x * x * x;statement.

Obviously, the only thing you are expected to do in this lesson is click RUN. You then see that the returned value is NaN. Yet, this tells me it is incorrect (duh, I know, that was the point of this lesson). Yet, when I change it to a number, it tells me I am incorrect – probably because what the instructions say to do is click RUN, not change it to a number. So, I can never successfully complete this lesson and hence can never complete the course. Since I didn’t really care beyond annoyance, I just went on to the next one.

Not surprisingly, with 3 billion users, there is no way to give feedback on this lesson other than the thumbs up or thumbs down.

I want to say that Codecademy IS free and it IS helpful for me to spend 20 minutes on basic stuff because I don’t use javascript right now unless I force myself, while I have a project coming up where i am going to need it a lot. I would agree with Watters that you can do all of the lessons and still not be able to code anything. Applications like FizzBuzz – which I have not had a chance to do yet – are a really good addition since she wrote that article.

Codecademy IS free and it doesn’t suck. If you are brand new to programming the beginner’s course is very good as a basic introduction. However, it is very far from perfect and is not going to be replacing the MIT Computer Science department any time soon. In fact, I would say that it’s not going to replace any really good books on Javascript, instructional videos on youtube, the OReilly School of Technology or other quality resources out there that can provide a whole lot more detailed explanation.
Again, it’s free. So far, as I see it, it is good for getting people off to a beginning on basic concepts and it is okay for reviewing topics you already know. Will it develop a huge cadre of web programmers? Definitely not on its own. As others have speculated, I doubt more than 10-15% of those who sign up will complete much more than the beginner course. What it MIGHT do is encourage some people enough that they start from here and pursue some more serious study on their own, and that is a good thing. Or maybe some amazing courses will come out in the future and I will be totally wrong.
But all of this cheerleading about “Hurray, I’m learning to code! This is wonderful!”

You’re trying too hard. Cut it out. The emperor isn’t naked but he isn’t really all THAT well-dressed, either.

Next Page →