Aug

30

Given that I taught my first statistics course in something like 1987 , it’s surprising how much I am looking forward to teaching this fall. I gave up being a full-time professor in 1997. Both my academic career and my consulting business were requiring my undivided attention and I had to choose one. I’ve never regretted that choice but at the same time, I have always missed teaching. For a long time, I would teach a course or two a year just to keep my hand in, but the last few years have been too busy.  Also, many schools schedule adjunct positions fairly close to the start of the semester and I tend to be booked months ahead, so it doesn’t usually work out. Early this year, I was delighted when Pepperdine University asked me if I would be available to teach statistics in their doctoral program this fall. I love statistics and doctoral students are a lot of fun to teach.

I wanted them to at least have the option of using SAS, especially now that there is a free SAS On-Demand for academics. It had been a couple of years since I used SAS On-Demand, too. I tried it out almost as soon as it was available. My first impression was not too favorable. It seemed too slow to be useful in classroom demonstrations, although I thought that students could use it for their own work, they’d just do what I did when I was testing it and have a book to read between waiting for tasks to execute.

Well, I’ve been trying it out all day today and has it ever been a pleasant surprise!

There were a few things I had to remember, like the specific LIBNAME statement to use (you’ll see it when you log into your account), but overall everything went without a hitch. I uploaded a data set with over 7,000 records, using Filezilla. I ran the CHARACTERIZE DATA task, a factor analysis, a chi-square, several graphs, changed the output to html. In fact, I spent so much time on it that it ruined my plan for spending the afternoon in the Santa Monica mountains. I kept thinking that I would go in fifteen more minutes.

It actually had fewer problems than the SAS Enterprise Guide running on my desktop. I only ran into two problems all day, one minor one and one very irritating one. The minor one was the often when I changed the properties of a variable, say, changed the format to percent8.2 , or changed it to some user-defined format, I would get an error message saying SAS Enterprise Guide had an error an recommended I save my work. I just ignored it and went along with my life. I often run into this same error message on SAS EG on my desktop and it crashes, so this was just a minor annoyance.

The huge pain in the ass was trying to get the last chart the exact way I wanted it. For no particular reason, I decided to graph the distribution of the number of books in the home for students who did not have a home computer.  Everything went along well except for the X axis was labeled 1.4,  1.95 and other silly numbers.  The original format was FAR too long to fit the values on the X axis so I created a new format. The next problem was getting the values to fall in the unformatted order instead of having 101-200 followed by 11-26 books. It was very, very annoying and I never did figure out how to get it to work using the GRAPH option in the task menu. In the end, as you can see, I just re-did the values in the PROC FORMAT. It was annoying. The charts were interesting, though.

Distribution for students with no computers shows the mode at 0-10 books

 

You can see that for students with no home computer the mode was 0-10 books and it went down rapidly from there. It is a noticeably skewed distribution.  Also, note that only about 6% of the students in total said they did not have a computer at home.

Students with a home computer have a normal distribution with the mode occurring at 26-100 booksYou can see that the distribution for children who do have a computer at home is very different, with more of a normal distribution centered around a mode of  26-100 books.

Except for the chart annoyance, I had very little difficulty using SAS On-Demand. I’m sure the students will find it frustrating just because it will be brand-new for them and any statistical package would be frustrating.  Overall, though, I would say it was far better than just usable. It was interesting enough to suck me in for hours and hours. I hope it will do the same for my students.

On top of all of that, I just signed up for a technology and learning conference for faculty on the Pepperdine campus, in Malibu, so I hope to get even more good ideas on using technology to teach from that.

And now, since it is far, far too late to make it to the mountains, I’m going to go walk along  the beach for an hour until it is time to go pick up the world’s most spoiled 13-year-old from volleyball practice because the earth would cease turning on its axis if she were to have to walk a mile home from school in 70 degree weather.

Yeah, it’s a tough life here in Santa Monica.

Aug

25

Today, taking a break from writing the grant proposal that has no end, I found myself thinking about easy ways to explain and understand standard error.

To understand standard error, you have to have some statistic that you’re discussing the standard error of. As a random example, let’s just take the mean.

T-TEST PROCEDURE FOR TESTING FOR DIFFERENCE BETWEEN MEANS
Let’s just say we have a sports organization that is interested in knowing whether there is a significant difference between the numbers of competitors in the male and female divisions each year. Since Title IX passed and we are supposed to be all equal in sports, there should be no significant difference, right?

To test for the difference in means between two groups, we compute a t-test. A t-test done with SAS will give four tables of results. The first one is shown below.

Table 1
First PROC TTEST Table
sex             N    Mean         Std Dev    Std Err    Minimum    Maximum
Female    22    97.0909    15.8201    3.3729    66.0000    119.0
Male         22    222.8        45.2253    9.6421    146.0         324.0
Diff (1-2)        -125.7        33.8792    10.2150

There were 22 records for males and 22 for females. The mean number of competitors each year was 97 for females, with a standard deviation of 15.8 , with a range from 66 to 119. For males, the mean number of competitors was almost 223 per year, with a standard deviation of 45, and a range from 146- 324.

What exactly is a standard deviation ? A standard deviation is the average amount by which observations differ from the mean. So, if you pulled out a year at random, you wouldn’t expect it to necessarily have exactly 222.8 male competitors. In fact, it would be pretty tough on that last guy who was the .8! On the other hand, you would be surprised if that year there were only 150 competitors, or if there were 320. On the average, a year will be within 45 competitors of the 223 male athletes and 95% of the years will be within two standard deviations, or, from 132 to 312.  That is, 223 – ( 2 x 45)  to 223 + (2 x 45) .

WHAT IS THE STANDARD ERROR AND WHAT DETERMINES IT?
But what is the standard error? The standard error is the average amount by which we can expect our sample mean to differ from the population mean.  If we take a different sample of years, say, 1988- 2009, 1991- 2012, all odd numbered years for the last 30 years and so on, each time we’ll get a different mean. It won’t be exactly 97.09 for women and 222.8 for men. Each time, there will be some error in estimating the true population value. Sometimes we’ll have a higher value than the real mean. Sometimes we’ll underestimate it.

On the average, our error in estimate will be 9.64 for men, 3.37 for women.

Why is the standard error for women lower? Because the standard deviation is lower.

The standard error of the mean is the standard deviation divided by the square root of N, where N is your sample size. The square root of 22 is 4.69. If you divide 15.82 by 4.69, you get 3.37.  Why the N matters seems somewhat obvious. If you had sampled several hundred thousand tournaments, assuming you did an unbiased sample, you would expect to get a mean pretty close to the true population. If you sampled two tournaments, you wouldn’t be surprised if your mean was pretty far off. We all know this. We walk around with a fairly intuitive understanding of error. If a teacher gives a final exam with only one or two questions, students complain, and rightly so. With such a small sample of items, it’s likely that there is a large amount of error in the teacher’s estimate of the true mean number of items the student could answer correctly. If we hear a survey found that children of mothers who ate tofu during pregnancy scored .5 points higher on a standard mathematics achievement test, and then find out that this was based on a sample of only ten people, we are skeptical about the results.
What about the standard deviation? Why does that matter? The smaller the variation in the population, the smaller error there is going to be in our estimate of the means. Let’s go back to our sample of mothers eating tofu during pregnancy. Let’s say that we found that children of those mothers had .5 more heads. So, the average child is born with one head, but half of these ten mothers had babies with two heads, bringing their mean number of heads to 1.5. I’ll bet if that was a true study, it would be enough for you never to eat tofu again. There is very, very little variation in the number of heads per baby, so even with a very small N, you’d expect a small standard error in estimating the mean.
The second table produced by the TTEST procedure is shown in Table 2 below. Here we have an application of our standard error. We see that the mean for females is 97, with a 95% Confidence Level (CL) from 90.07 to 104.1.  That 95% is the mean minus two times the standard error, plus two times the standard error. That is, 97.09 – (2 x 3.37)  to 97.09 + (2 x 3.37).

Why does that sound familiar? Perhaps because it is exactly what we discussed 10 seconds ago about a normal distribution? Yes, errors follow a normal distribution. Errors in estimation should be equally likely to occur above the mean or below the mean. We would not expect very large errors to occur very often. In fact, 95% of the time, our sample mean should be within two standard errors of the mean.
Table 2
Second PROC TTEST Table

sex               Method        Mean      95% CL Mean             Std Dev    95% CL Std Dev
Female                               97.0909    90.0766    104.1    15.8201    12.1713    22.6080
Male                                    222.8       202.7    242.8        45.2253    34.7941    64.6299
Diff (1-2)    Pooled                -125.7    -146.3    -105.1    33.8792    27.9348    43.0609
Diff (1-2)    Satterthwaite    -125.7    -146.7    -104.7

The next two lines both say Diff (1-2) and both show the difference between the  two means is -125.7. That is, if you subtract the mean for the number of male competitors from the mean number of female competitors, you get negative 125.7. So, there is a difference of 125.7 between the two means. Is that statistically significant? How often would a difference this large occur by chance? To answer this question we look at the next table. It gives us two answers. The first method is used when the variances are equal. If the variances are unequal, we would use the statistics shown on the second line. In this instance, both give us the same conclusion, that is, the probability of finding a difference between means this large if the population values were equal is less than 1 in 10,000. That is the value you see under the PRobability > absolute value of t. If you were writing this up in a report, you would say,

“There were, on the average 126 fewer female competitors each year than males.  This difference was statistically significant (t = -12.30, p <.0001).”

Table 3
Third PROC TTEST Table
Method    Variances           DF    t Value    Pr > |t|
Pooled    Equal                    42            -12.30    <.0001
Satterthwaite    Unequal    26.064    -12.30    <.0001

In this case the t-values and probability are the same, but what if they are not? How do we know which of those two methods to use?

This is where our fourth, and final table from the TTEST procedure comes into use. This is the test for equality of variances. The test statistic in this case is the F value. We see the probability of a greater F is < .0001. This means that we would only get an F-value larger than this 1 in 10,000 times if the variances were really equal in the population.  Since that is a really large number, and the normal cut-off for statistical significance is p < .05  and  .0001 is a LOT less than .05, we would say that there is a statistically significant difference between the variances. That is, they are unequal. We would use the second line in Table 3 above to make our decision about whether or not the differences in means are statistically significant.

Table 4
Fourth PROC TTEST Table
Equality of Variances
Method    Num DF    Den DF    F Value    Pr > F
Folded F    21                    21           8.17    <.0001

Aug

22

The census now allows more than one race to be checked. For many years, friends of mine in inter-racial couples when they registered their children for school would check the “Other” box for race, rather than pick black or white.

Although an individual’s census form responses are confidential, you certainly are free to tell anyone what you put. In response to an inquiry, the white house spokesperson said that President Obama had checked only “African-American or Black“, even though his mother is white.

Now that you can select both black and white as race, I wondered how many people did. Unlike normal people who wonder about these things, I decided to download all 3,030,728 records from the 2009 American Community Survey to find out. Once I downloaded the survey and read it into SAS, I produced the chart below. I was quite surprised to see how few people checked both black and white. As you can see, it was less than 1%.

Distribution of race shows 0.7% selected both black and white

The SAS code to create this chart is shown below. You might think this is a ridiculous amount of work to create one chart and you could do it way easier in Excel. You’d be correct except for two things.  One, I know that earlier versions of Excel no way could you read in 3,000,000+  records. Even if you can do it now I’ll bet it’s painfully slow. Two, most of these options or steps only need to be done once and I was doing multiple charts. The AXIS and PATTERN statements only need to be specified once.

If you DO want to create your chart in Excel, you could just do the first part, the PROC FREQ, and then export your output from the frequency procedure to a four- record file and do the rest in Excel. There is no need to get religiously attached to doing everything with one program or package.


PROC FREQ DATA = lib.pums9 NOPRINT;

TABLES racblk* racwht / OUT = lib.blkwhitmix ;

WEIGHT pwgtp ;

DATA byrace ;

SET lib.blkwhitmix ;

IF racblk = 1 AND racwht = 0 THEN Race = "Black" ;

ELSE IF racblk = 0 AND racwht = 1 THEN Race = "White" ;

ELSE IF racblk = 0 AND racwht = 0 THEN Race = "Other" ;

ELSE IF racblk = 1 AND racwht = 1 THEN Race = "Mixed" ;

PERCENT = PERCENT/ 100 ;

AXIS1 LABEL = ( ANGLE = 90 "Percent") ORDER = (0 to 1 by .1 ) ;

AXIS2 ORDER = ("White" "Black" "Mixed" "Other" ) ;

PATTERN1 COLOR = BLACK ;

PATTERN2 COLOR= GRAY  ;

PATTERN3 COLOR = BROWN ;

PATTERN4 COLOR=WHITE  ;

PROC GCHART DATA=byrace ;

VBAR Race / raxis = axis1 maxis = axis2

SUMVAR= percent

TYPE=SUM

OUTSIDE= SUM

PATTERNID = MIDPOINT ;

LABEL  Race  = "Race"  ;

 

FORMAT percent percent8.1  ;

Aug

18

<- On the lake shore last night

It wasn’t until a banker sneeringly said to me,

“So, you’re a lifestyle business.”

That I realized lifestyle business was not, as I had mistakenly thought, people who called themselves “life coaches” , Mary Kay consultants, personal trainers, financial consultants or other businesses that are telling you what life style to have.

No, it turns out that a lifestyle business is what we referred to back when I started in the consulting business in 1985 as a “small business”. You provide a service. People pay you. You pay bills. You pay taxes. At the end, if there is money left in the bank, you are still in business. If there is enough money, you get trips to the Bahamas, private schools for the kids, 401(k) for retirement and dinner at Chinois on Main.You work hard, deliver the goods for your customers and live in the style to which you want to be accustomed. Like either Romy or Michelle (I forgot which) in Romy and Michelle’s High School Reunion, I always thought we were happy and had a good life.

According to the banker, and the self-appointed “serial entrepreneurs”, I was wrong. The point of business, is to have an “exit strategy”. My original exit strategy was to have one of my kids take over the business. All of them have worked for me at some point, entering data, editing grants, writing articles for newsletters. They started out at the entry level and by the time they got to senior research associate, they bailed, telling me,

“You work too hard.”

Also, they had their own interests. Fair enough.So, I am working, running a business and I guess when I get bored of it I will wind things down and spend my remaining days on a beach on some Caribbean island, being visited regularly by my children and grandchildren.

That is WRONG, the banker tells me. My exit strategy is supposed to be to have all of the work done by people in other countries working for less than minimum wage, or, at least, far less than the prevailing wage in America, and then sell out for lots of money. My goal is supposed to be to become the next Sergey Brin, Bill Gates, Mark Zuckerberg or whoever made millions this week. According to the banker, I am a failure at business.

According to Luke Johnson of the Financial Times, though, maybe not so much. Dave Winer wrote a blog post a while back where he pointed out very rightly that if you’re making over a million dollars a year a good bit of it is “windfall”. In an interesting article with the title, “Zuckerberg wannabes squander careers” Johnson points out that getting huge amounts of capital without much more than a business plan is akin to winning the lottery. What Johnson points out is that those businesses that don’t get the windfall go out of business.

I remember driving by eToys when they were going bankrupt and seeing movers taking everything out of the building. And they were one of the companies that actually sold stuff. Many more companies I hear about that are disruptively innovative amazingly great and yet I never see a product or service come out of them and I never hear of them again after their exit. Some people, like eToys, Pets.com or Webvan , I think sincerely intended to develop a business and it did not work out. I respect those people and I certainly understand them. Not every product line or service we ever tried made us money and I have nothing bad to say about people who take risks and fail. That’s just the way it is. We may be in that situation in the next few years, where we either make a lot of money on our new product we’re developing or we make less than we had hoped. Having learned from some examples of people who ramped up too quickly, I think it unlikely we’ll go bankrupt or out of business, but it’s possible sales will be disappointing.

On the other hand, it seems like a lot of businesses are just another version of the Madoff scheme where money changes hands and everyone is getting high returns – until they aren’t because there is no real product, just an idea. I remember seeing a prospectus years ago, for the Reusable Pie Crust company. Their marketing pointed out that since no one eats the pie crust anyway, it could be replaced by a reusable pie crust and no one would know. The location of the company headquarters was listed as “In Mr. Andrews’ hat”, the members of the Board of Directors all had the last name of Andrews, etc. It was obviously a joke. Perhaps it is my old crankiness but some of the start-ups I have seen sold make me think of the Reusable Pie Crust, Co.

Years ago, I worked for a large corporation that had been a successful start-up and sold out. I had a nice office, a nice salary and a comfortable 401k. I downloaded a lot of photos from webshots on my computer screen, and even ordered a couple of posters for my office walls.

Last night, I had finished the business I had to do in North Dakota and spent the night at the Woodland Resort. On my way to the airport, I was driving along a country road and it occurred to me that I get to work in those places that other people use as their screensavers.

So, I would advise against being too quick to discount those “lifestyle businesses” and I think I’ll keep with my small bank that has the same idea. Last time I drove up to make a deposit, they didn’t make any derogatory comment about my business. Instead, two bankers came out and cleaned off my windshield and gave me a hot dog (I’m not kidding and I still haven’t the faintest idea what that was all about).

 

 

 

Aug

1

I’ve forgotten more about statistical software than you’ll ever know!

I don’t know why people ever say this in a bragging tone because I consider that to be my problem. I’ve forgotten it. Today, I needed to do a confirmatory factor analysis with someone using AMOS. They wanted it in AMOS so that is what I did. All the parameter estimates came out correctly, model fit indices good. The only problem is that I knew that they would want the estimates on the diagram and I could not remember how to do it. Yes, they could have copied and pasted it into Word or Graphic Converter or any of a number of other packages and then typed the numbers into text boxes, but people pay me so they can do less work, not more. Besides, I KNEW there was an option or something. Here it is:

When you are at this point and you have your lovely model well-specified and all of the output shows up nicely when you select TEXT OUTPUT under VIEW

AMOS diagram without estimates

Click the second button at the top of the middle pane. That will put your path estimates on your diagram.


Three other things to remember

  1. For students, there is a FREE version of AMOS 5. It works great. I was amazed that something free would work so well. I thought when I saw there was  a free version for sure it would be a scam, but no, I downloaded it from AMOS Development Corporation, ran it on a Windows 7 machine and it works great. I believe it only runs on Windows.
  2. The AMOS manual by James Arbuckle is incredibly well-written. I’m using AMOS with SPSS 18 but I noticed he wrote the 19 manual also and it’s pretty much the same. Don’t buy the manual, you can download the pdf free from lots of university sites.
  3. The Indiana University Stat/Math center has a really straightforward discussion of CFA using AMOS.

 

Now you know the secret of why I write this blog. Because every time I forget something I remember, “I wrote a post about that once”. And the nice thing about the Internet is even if I’m in Fort Totten, ND or Lac du Flambeau, WI or Tunis, Tunisia, I can look it up and find it.

Unlike my coffee cup. Now, where did I put that?

Blogroll

WP Themes