Jul
7
What Small Businesses Need to Create Jobs
Filed Under Dr. De Mars General Life Ramblings, The Julia Group | Leave a Comment
I’ve been in business for over twenty years. All of that time, I have run a small business, by choice. During those twenty years, I have had a sick husband, been widowed, had four children – so I had some reasons that becoming the next Oracle was not my priority. However, I have made a profit every year, some years more than others, and have increased and decreased my number of employees as necessary.
The more articles I read on small business in general and women-owned businesses in particular, the more I wonder how many of those organizations talking about helping small business owners create jobs include people who have actually run a small business.
There seems to be a great concern about the disparity in access to venture capital. Now, that may be a concern for some small businesses but most of the people I know own consulting companies, hair salons, restaurants, retail stores or manufacture products like t-shirts. They are not attractive to VCs because they are not going to have exponential growth.
Many of these small business owners, like me and my friends, are going to be in business for ten, twenty years or more, and pay corporate taxes, payroll taxes and everything else our accountant says we have to pony up every few months.
What about jobs?
I think everyone trying to create jobs through small business should read the insightful article Andrew Grove, Intel-cofounder, wrote on this subject. Those high-flying tech firms create a lot of jobs – overseas ! One problem with the VC-find-the-next-Apple approach, of course, is that those jobs may help investors but they don’t help the U.S. unemployment rate. Many, many of the high tech, high ROI jobs end up in China and India. (Seriously, read Grove’s article. It’s great.)
Twenty years ago, my business partners and I decided against outsourcing because we did not want to employ fewer Americans and pay someone in another country a sub-minimum wage so we could be richer. I know that sounds un-American, but part of our motivation in founding a business, which still derives much of its revenue from work on reservations, was to make life better for people. Obviously, we are privately owned, so we can make those choices.
The other thing I don’t need that every agency and company seems to want to sell me is a business plan. I have a business plan. Like most companies, the gist of it is to have revenues exceed expenses. Okay, it is a little more than that, BUT – after 20 years most of the business owners I know are not kept from hiring from lack of a plan. In fact, their plan is to add workers to meet the demand. It is certainly NOT to take out loans (guaranteed or not) so we can expand and hire more workers.
If anyone really seriously wanted to help small business create jobs they would make it easier for them to get business.
I had to laugh. Several times, representatives from the same “small business services” company have called me telling me,
“We’ll help you get YOUR money from the federal government. After all, it’s YOUR money.”
and then went on to promise me we could get on the GSA schedule and agencies would be falling over themselves to just pull our name up and order a million dollars of consulting services from us. I told their representative that’s not the way it works and he assured me it was and they had done that for lots of companies. I told him to email me the name of one. I’m still waiting.
I am not sure where the stimulus money went. I see some signs that the roads are being upgraded with Recovery Act funds, so that is a good thing. I don’t actually know anyone who got any of that $200 million that went to the NIH in grants, although I know a lot of people who applied, but that all went to universities any way.
I may actually bite the bullet and complete the section 8(a) application this year, although it grates on me to do it. The time I spend on that will take away from billable hours so it will actually COST me money. I’m still debating on it.
Don’t get the idea that we’re sitting around here whining. We have work enough to keep the people we have employed and I am now looking for new contracts. We’ve already turned down a few over the last year, which may sound inconsistent, but it’s not.
We submitted one proposal in May, a second in June. We have too much current work to take time away to do a proposal this month. I’ll submit one or two in August and September, depending on how tired I feel.
Taking a six-month or shorter contract that takes up all of your time and keeps you from bidding on multi-year contracts is not good business. Just bidding on everything that comes down the pike isn’t too bright, either. We look for a match between our capabilities and what the client needs, for areas we can really do excellent work. That way, they are happy and come back to us again and again. After a quarter-century in business, we DO kind of know what we are doing.
I hear a lot about tax breaks for small business. Well, we pay a hell of a lot of taxes and that would be nice. Even though we will probably be exempt from the requirement to provide health care, I have always offered that as an option to employees and our costs may go up a little. Taxes and health care costs are not what keep me from adding employees.
It seems like the people aiming to help small businesses are sincere. However, it’s like the old cliche that when the only tool you have is a hammer, every problem is a nail. Because most of these organizations have people who know how to write business plans, fill out loan applications, apply for certifications of some status or another and lobby on Capitol Hill, that’s what they see as the way to help small businesses.
Most people who have been in business for decades don’t need some consultant to help them develop a business plan before they can add jobs. If their business has been around a long time, they already have a line of credit. I’m not sure what they need is tax cuts or worse health care coverage for their employees.
What they need is work.
I’m surprised I have to explain this to you.
Nov
19
What Matters in Statistics
Filed Under Technology, The Julia Group, statistics | 1 Comment
Maybe I have been wrong.
It wouldn’t be the first time, in fact, most of the really great things in my life have come about when I realized I was on the wrong track and took a sharp right turn. (Uncharacteristically skipping the opportunity here to make snarky comment about my first boyfriend, job or marriage. )
Four incidents made me reconsider what matters in statistics.
The first two were related. I attended a really fascinating lecture on statistics where the speakers discussed pretty much what morons 99.99% of the world were, how assumptions are violated, variables are not normally distributed and as a result our standard errors are usually wrong, and not just a little bit, by a lot, and we should all be learning new statistical methods. Honestly, I found myself nodding my head in agreement at most of what they said, and I have to admit that I have been guilty of some of these same errors myself.
Shortly thereafter, I was at the SPSS Directions conference and I asked the equally fascinating Bill from SPSS about the research he was doing to predict shipments of contraband, violent incidents in Richmond and cyberattacks. He said much of what they were using was Decision Tree.
It dawned on me that both of these brilliant men were right. On the one hand, making the wrong assumptions can inflate your standard errors by a lot. On a practical level, with 100 or even 300 subjects this can make a substantive difference. However, if you have 12,000 or 12,000,000 records then yes, your standard error may actually be three times what you incorrectly have assumed, but if that means it is .00027 instead of .00009 – it isn’t going to make much difference. This harks back to that article in Wired a few years ago, on The End of Theory. The gist of it was that with the amount of data we have, who needs science. At the time, I spotted some immediate flaws in the argument, and similar ones since. If you don’t look even at some basic regression assumptions you may miss, for example, that your prediction is great overall but for your highest dollar customers, most violent offenders, or whatever your dependent variable, your error rate is much higher, and these are the exact people you most want to predict (i.e., you have violated the assumption of homescedasticity). I have to say, too, that the analogy with language translation was less than compelling, the article said that Google can translate from one language to the next without knowing the language. Well, not so well. I just typed a phrase in Spanish into Babelfish that I had said to my little daughter lying in the next bed in the hotel. It means, “You are such a beautiful little girl”. The translation came back, “That pretty girl.”
… and yet … it was close, it didn’t come back with “Hey, buddy, you want to buy a goat?”
It wouldn’t take too much effort, really, to take the 1,000 or 10,000 most common phrases in each language and enter those into the software and have these checked and THEN go on to the word by word translation.
The same with statistics. You can build in the diagnostics, as SAS has done with ODS GRAPHICS, for example.
I think some knowledge of statistics is crucial, but I am not so convinced that minor departures from normality, small correlations among variables or some heteroscedasticticity will damn us all to statistical hell when our datasets are approaching hundreds of gigabytes on a fairly regular basis. Yes, it will not be precisely correct. Yes, there IS danger to not understanding some of the basic underpinnings of statistics. The Chronicle of Higher Education forum has anonymous (more or less) users but some day I do hope to meet in person the person whose signature says,
“Being able to find SPSS in the start menu does not qualify you to run a multinomial logistic regression.”
Not only funny, but true. On the other hand, I don’t think you need to be able to calculate structural equations in your head to be qualified to design, conduct and interpret a statistical study.
The third incident included the same Bill- the-I-think-he-was- a-vice-president from SPSS. (A person better than me at sucking up would have found out his last name. I did make a faint effort on Google, yet again proving less than infallible as Bill appears to be a popular name for SPSS vice-presidents and I am writing this at 1:30 a.m. on the east coast as I have not yet adjusted to the time zone so am somewhat imprecise.) I ran into him after his presentation and asked him if he had published his results, as the improvements in prediction they had achieved were really quite remarkable. He said,
“Not in the sense that someone like you from a university means by publishing.”
He went on to say that he had presented at conferences like this one and discussed the results with customers but not published in an academic journal.
Finally, there was the SAS Tech Report, in which editor Waynette Tubbs mentioned about finding a job and networking, “Are you published?” but she meant having a blog, doing papers at places like WUSS (Western Users of SAS Software) and SAS Global Forum. This is very far from the definition of publishing that I was taught (nearly brainwashed) as almost all doctoral students at research institutions are, that peer-reviewed, academic journals were the gold standard, be-all, end-all and 90% of the measure of your worth as a human being.
So, is it just barely possible that having a very, very good understanding of statistics, albeit not to the point of tossing off pages of proofs off the top of your head, and writing about it in a comprehensible fashion is what really matters? To a regular person, this probably doesn’t sound too crazy, but to someone who has spent most of a lifetime in academia it borders on heresy.
I think that knock on the door is a group of inquisitors come to burn me at the stake.
On the other hand, it may just be room service.
Aug
4
Discovering if your data blow with help from SAS Enterprise Guide
Filed Under Grantwriting, Software, The Julia Group, statistics | Leave a Comment
“Is there anything you can do to help? I’d kill you but there is a law against it. You’d better leave before I figure out a way around that.”
This comment was made by a co-worker of mine who had saved all of the data for his thesis for a masters in computer science on his hard drive. Someone who needed assistance had stopped in his office, popped in a floppy disk and accidentally formatted the hard drive instead of the floppy. I tell this story just to point out that people screwing with your data is a phenomenon that dates back to at least floppy disks, which, if you ask my children, is equivalent to prehistoric.
Why You Need to Look at Your Data Seven Different Ways before you do ANY Statistical Analyses.
- The data were entered by clerks making minimum wage who hate that they are doing a job that, were it not for animal cruelty laws would be done by a half-trained monkey.
- The data were entered by really bright undergraduates at a prestigious university who smoked something really good before coming in to work. (Are they still called joints? Email me if you know the answer.)
- After you taught all day, graded papers, read the RFP for your next grant, you entered all the data yourself – and finished both data entry and your third martini at 2 a.m.
So… you have your data entered into SAS Enterprise Guide. Congratulations. The very first thing you should do is from the Tasks menu, select Describe and then, select the List Data option. If you have a small dataset, you may want to list the whole thing. Otherwise, click on the Options tab. In the window to the right in the drop-down box under Rows to list select ‘Every nth row’, giving a value for n, say 10. This is what statisticians refer to as a systematic random sample and what other people, who do not invite us to their parties, refer to as every tenth row.
The output is very plain vanilla, as you can see. You could make it prettier, but why? I do like the fact that SAS EG lets me output it as an html file so it can be uploaded easily and read by anyone. Because I do a lot of work as a telecommuter, this makes my life easier. Unlike most of what makes my life easier – the housekeeper, the detail car wash guy, Safari Books – the html output feature doesn’t charge me. So, props to it.
Go here for more step-by-step on how to use List Data. This is my personal university web page.
(I can link from here to there but not vice versa because some people are concerned about a rumor that this blog is written without supervision by the university attorneys, or in fact, by a responsible adult of any profession. This rumor is true.)
Next awesome innovation, go to Tasks again, then Describe then Characterize Data. This task reminds me of the first grader who wrote in his book report, “This book taught me more about penguins than I wanted to know.”
The characterize data task may tell you more about your data than you want to know if you just go with the default options, so I wouldn’t. I’d recommend unchecking the boxes next to Graphs and also the one next to SAS Datasets that produce the datasets containing Univariate statistics and frequencies. You may need those datasets or charts for every variable, but usually you don’t. It just slows down your job and produces a bunch of output you aren’t going to look at, especially if you have dozens or hundreds of variables. You may want to look at graphs for some selected variables later.
By default, the characterize data task will give you frequency distributions for categorical variables with 30 or fewer categories, and, for other categorical variables, the frequencies of the 30 most common categories. You can change the default from 30, if you would like. It will also produce descriptive statistics for all numeric variables, as well as the number missing values. Again, you can make your output prettier than my output shown here, with titles, footnotes and probably embedded images of bells and whistles, but since the purpose of this page is to check for out of range values, outliers, etc., why bother, unless you are really, really bored.
In some cases, it may be of interest to see if you have a normal distribution because you really do expect one. In this case, go to Tasks again, select Describe then Distribution Analysis. If you select Normal under Distributions, you can enter the hypothesized mean (except it isn’t hypothesized at all since you just saw it in the previous task) and standard deviation, too, if you so desire. Click on Plots and then select Histogram to see a histogram of your data with a normal curve super-imposed.
You can also use the Titles options to enter titles and footnotes, since one should never miss the opportunity to suck up to the funding agency. If for you want to change the output for some reason, say, you have a purple fixation, you can go to the Tools menu and select Options. Click Results, then HTML. You can select a different style for your output, then re-run the distribution analysis.
There. Purple. Are you happy now?
Actually, I am happy. The data look pretty good. Everything is pretty much in range, as shown in the descriptive statistics, not much missing data, the values and distributions on all the categorical variables are reasonable, the dependent variable is approximately normally distributed, so we are good to go on parametric models.
Reality check passed. For the data, that is. As far as those smoking, martini-drinking minimum-wage earning data entry people, the jury is still out.
Jul
31
There is no such thing as conservative math!
Filed Under The Julia Group, statistics | Leave a Comment
Statisticians should not listen to talk radio or to anything on the Fox network. Those people who say that you can prove anything with statistics are mistaken. You can prove anything with statistics to people who don’t understand statistics. I think some of those same people you can prove anything to with a box of doughnuts.
So, last night we were listening to someone talking about the health care debate and for the umpteenth time it was brought up that,
“If Obama’s bill passes, we’ll have socialized medicine, JUST LIKE IN CANADA!”
What’s with all the hating on Canada all of a sudden? Have these people ever been to Canada? They make it sound like it is full of health-care preventing unlicensed physicians looking to give lethal injections to your grandmother. Seriously, the scariest thing you can think of is that we’ll become Canadian? Now, if they had some evidence that government health care (which, by the way, what do you think Medicare is) causes cold weather, I would vote against it.
However, the latest argument that really set me off was when someone pointed out on one of these shows,
“You know, Canada actually has a life expectancy a couple of years longer than the U.S.”
[I verified this fact on that source of all knowledge, wikipedia.]
Did overweight, overpaid talk show host at this point stop and say,
“Oh, really?”
No, he did not. He said,
“Well, you can’t compare the two. The U.S. has a population ten times that of Canada. So, there are ten times more homicides, ten times more accidents.”
At this point I began yelling at the TV,
“YOU MORON! That would mean the absolute NUMBER of deaths due to those causes is higher. Don’t you know the difference between a frequency and a percentage?”
My husband, a.k.a. the calmest person in the world, as evidenced by his 12 years of marriage to me, said,
“Well, that must be conservative math. It must be different than liberal math.”
“There is no such thing! There is only math. That is why I specialized in statistics, for God’s sakes! So I wouldn’t have to have stupid arguments with people on whether the square root of the variance was the standard deviation in Tibetan culture or if it was a male chauvinist standard deviation. It just IS. If you have ten times as many people and ten times as many homicides, your rate of homicide is the SAME. Besides, we have way more homicides than Canada!”
This argument was settled by my husband making me a martini, with olives, and agreeing with me.
By the way, what is it with all this Canada-bashing all of a sudden. It used to be that high on the list of “stuff white people like” was threatening to move to Canada. I noticed that all of these conservative math advocates are white. Where are they going to threaten to move to now?
Jun
12
When acceptance is really rejection: Death by Green Pants
Filed Under The Julia Group, statistics | 2 Comments
The model is non-significant, therefore my theory is supported.
Huh?
Just when you thought it was safe to get back into statistics… It took you two years of graduate school but now you have it down. P-value low = good, relationship detected, publication, tenure, Abercrombie & Fitch models at your feet.
P-value = high, no relationship, no publications, no money, dating the creepy guy next door.
Enter Hosmer to screw things up.
There are a whole bunch of reasons you might want to do a logistic regression (no, I’m serious). If you want to predict a categorical dependent variable like death, drop-out or watching Afghan Star. If you were going to do a propensity score match you would start with logistic regression. If you plain can’t think of anything else to do with your evenings.
The first thing would be to see if your dependent had a relationship with your grouping variable or you really are wasting your time. Okay, now that is settled, you have found that people seen in hospitals with Intensive Care Units are more likely to die than those seen at other hospitals.
You also want to see if the variables on which they differ have anything to do with the outcome. For example, I ran an analysis where I coded their favorite colors of pants – blue, brown, white, black or green pants (seriously, who buys green pants?) . People who went into intensive care were more likely to own green pants. To test if this is significant, I run a logistic regression with death as the outcome variable and pants color as the predictor.
In SPSS you go to ANALYZE > REGRESSION > BINARY LOGISTIC
So, the Hosmer and Lemeshow Test is statistically significant with a chi-square of 349.06, df = 4 and p < .001. Is that exciting? Do I immediately publish an article on “The American Apparel Effect” and how poor fashion taste is dangerous to your health?
Not so fast. You see, Hosmer & Lemeshow tests the Goodness of Fit of the model predictions to the observed data. If you reject the hypothesis that your model fits the data, that is bad!
In my next logistic regression, I used age over 65 as a dichotomous variable. My second variable was the Dr. MechOth scale. Dr MechOth (not her real name) was a friend of mine when I was a young Assistant Professor who occasionally hung out in bars. Dr. MechOth rated all men on a 1 to 3 scale, where 1= “Yes” , 2 =”Maybe if I was drunk” & 3=”I couldn’t get drunk enough”.
The results of the Hosmer & Lemeshow test shown below, with a chi-square = 4.52, df = 3, p > .20 show that the data fit the model somewhat, although it could be better.

Does this mean that in logistic regression high p-values are always a good thing? Nope, that would be too easy for you to remember. In fact, no sooner have we inverted our understanding of p-values but now it is time to do it again. When interpreting the COEFFICIENTS, a low p-value is a good thing. So, which of Dr. MechOth’s groups one is in, and being really, really old are related to probability of death.

Sadly, my original hypothesis about death by green pants is not supported and all I have discovered is that if you are really, really old and no one would go home from a bar with you if you are the last person on earth, you are more likely to keel over dead from natural causes or suicide, whichever comes first, than hot, young people.
I do not think I will be winning the Nobel Prize for Medicine any time soon. I wonder if that guy next door likes Cup-A-Noodle soup.
Jun
3
Controlling for Damn Near Everything: Propensity Score Matching
Filed Under The Julia Group, statistics | 1 Comment
Lately I have been on a roll looking at relatively less common statistical techniques, proportional hazards, survival analysis, etc.
In keeping with that, I have been taking a look at propensity score matching, fondly known as PSM by, – well, by no one actually.
The problem to be solved ….
Think about some of these comparisons:
- Hospitals with special burn,cardiac or neonatal units versus general hospitals
- Public schools versus parochial, private or charter schools
- People who watch TV > 40 hours weekly versus those surfing the Internet > 40 hours
In all of these cases, and probably a lot more you can think of, there are very likely differences in certain “outcome” variables, whether it be survival in the case of hospital patients, academic achievement of students or annual income of TV versus Internet users. However, all of these comparisons also begin with groups who are already different.
For example …
You have two groups, say people who are treated at a hospital with a specialized unit for terminally ill patients and patients from another hospital without any such specialized unit. Your outcome variable of interest is whether the patient lived or died.
The simplest way to test this is a chi-square. You compare the percentage of people who survived at St. George of Money Hospital versus Heart of Despair County Hospital. There is a problem with that, though. A simple comparison will almost always show WORSE outcomes for hospitals with special units for patients who are terminally ill, seriously burned, extremely premature births, etc. The reason is probably obvious – if you get sicker patients, they are less likely to live. If your interest is in knowing whether having a specialized unit increases your chances of survival, you would want to compare similar groups.
It isn’t as simple as just controlling for severity of condition, though. There are other variables, for example, people who are better educated, who have private insurance and who live in urban areas all may be more likely to be patients at more “elite” hospitals. Some of those factors may be related to survival as well. What we’d really like is to compare a group of people from St. Money’s that is similar to patients from Despair.
In short, certain types of people have a greater propensity to be admitted to one type of place than the other.
Enter propensity score matching — to the sounds of trumpets and wearing a cape.
In fact, the first step is to do a logistic regression analysis and I will admit that it is not strictly necessary to wear a cape while doing so but it would probably be more comfortable than this business suit from Filene’s that I am wearing.
Using SPSS, go to the ANALYZE menu, select REGRESSION, then select BINARY LOGISTIC. Your dependent variable will be the hospital to which the patient was admitted. Covariates are the variables such education, severity of illness and insurance that you want to control. For variables that are categorical, e.g., insurance, which could be private, public (a.l.a. MediCal if it hasn’t disappeared in the latest round of state budget cuts) and none, click on the CATEGORICAL button and move those over to the “Categorical covariate” window.
Here’s the really important part — click on SAVE and select PREDICTED PROBABILITIES – that is your propensity score.
This is what you are going to match on. Hence the name.
This is step one. I would say it gets easier after this point – but it doesn’t.
Mar
6
The Sea Monkey Effect prevents robot uprising
Filed Under Dr. De Mars General Life Ramblings, Software, The Julia Group, statistics | 1 Comment
Since I have written about odds ratios and logs lately, I was going to write about the natural log of the odds ratio, however, random events have caused me to do otherwise.
I read an interesting blog by Adam Jackson lately, in which he is concerned that robots will take over the world. At the very least, he worries, all of the data we are submitting on ourselves via our friends on Facebook, our pictures on Flickr, our personal web pages on igoogle, travelocity account, grocery store frequent shopper cards and a thousand other sources from billions of people will be combined and inserted into a series of equations that will predict our every move.
I say it aint so. Billions and billions have been spent on data mining and we still cannot predict behavior with any degree of accuracy. Two obstacles are imperfect models and random error. The latter might be better described as, “The State of Being Human”. For example, I hate SQL. I even wrote a blog about how much I hate it. That doesn’t mean I never use SQL and even, if offered an unimaginably enormous amount of money, would not become an SQL programmer. In fact, I was going to write on my blog about types of sums of squares a few days ago (since it was Square Root Day), but instead, I have decided to write about how robots are not taking over the world.
Let’s go with the imperfect models, first. I have a tank of sea monkeys on my desk that I bought for the purpose of collecting random data. Among the variables I have entered into an SPSS dataset are the following:
- DAYS: Day since purchase
- MONKEYS: Number of sea monkeys
- SPECKS: Number of specks indistinguishable from sea monkeys
- EVENT: A binary variable, event occurred or not
- NAME_EVENT: A string variable, (e.g., added food, added water)
- ACTIVITY: an ordinal ranking of sea monkey activity from none to high
- WTF: Number of co-workers who walk by, look on my desk and say “WTF? Sea Monkeys?
I have tried various models to predict the dependent variable WTF, which is my co-workers’ behavior. I have tried all of the independent variables, linear functions, quadratic functions, t-tests, multiple regression. To date, my attempts could only be characterized as a success if success has a completely different meaning than the one with which I am familiar. In fact, in one spectacular non-success event, I actually achieved an adjusted R-squared of -.232, meaning in statistical terms that my equation was, in fact, WORSE than doing nothing in terms of predictability.

This can be interpreted as SPSS way of saying,
“Why don’t you just shut the hell up, because a better equation could be developed by, say, a sea monkey?!”
Part of my problem may be that I just don’t have enough data. It could be that I need more data, and once I have it, like Adam Jackson’s megalomaniac robots, I could take over the world. Or, my failure to attain predictive nirvana may be due to having an imperfect model. it could be that I have the wrong variables. In that case, I could have variables on a billion people on a billion days and run it on a super-computer and still end up with no better prediction than nothing.

It is very likely that my dependent variable is not explained by the number of sea monkeys, the number of specks, the degree of sea monkey activity or anything at all external to my co-workers. Perhaps the variables I should have measured are these:
BOREDOM: Total rating of each worker
ATTEND : Number of people at work today on the third floor
PERSONBORE: The product of the first two variables for a boredom factor for the floor
WEATHER: This could be an ordinal scale from nasty, cold, rainy to perfect beach weather
(March 4 in Los Angeles – everyone else, eat your heart out!)
Even if I were to have the right variables, you cannot measure everything. Believe me, I have been trying most of my life. There is that random error factor. Co-worker A may have a boredom index of 4,000,000 , the weather outside may be hailing bowling balls so she is not leaving the building, she may walk by my desk 762 times but she is NOT going to speak to me, not even if the sea monkeys get impact by radiation leaking out of the pipes, grow to the size of a dinosaur and swallow our new CIO whole because once, three months ago, she arrived at the elevator shortly after me and I did not hold the door for her and she had to wait for the next elevator thus getting to her meeting thirty seconds late and blowing her chance to make a great impression on the new CIO.
But, what if you had a billion billion data points… wouldn’t that work? No. If you had that many data points you would be uniquely identifying each person. That would work kind of like tests like the Myers-Briggs work, where you check off boxes that you’d rather go to a party than solve algebra problems, you’d rather exercise than paint pictures and you’d rather drink a martini than write a blog. You get results back that say you like parties, exercise and drinking better than math, computers and art. No new information is added.
Mathematically, if you use a billion different data points to predict a billion people’s outcomes, you get just what I got with a much smaller number – an adjusted explained variance of less than nothing.
According to the Myers-Briggs, I am probably one of the most introverted people on earth. Although I loathe meetings, I spent decades as a consultant flying around the country, because my husband died, I had three small children and the money was good. That is just one random event, but here is what I have noticed …. for all of us, life is a whole series of random events strung together.
So, I don’t think we need to worry about the robots rising up against us just yet. On the other hand, maybe we ought to keep a closer eye on those sea monkeys.
Feb
4
Phi coefficients, odds ratios and the F-word
Filed Under Algebra, Dr. De Mars General Life Ramblings, Grantwriting, Software, Technology, The Julia Group, statistics | 1 Comment
Yes, I am the F-word – a feminist. I was at a faculty meeting this weekend and one of the presenters began by saying, pointing to a colleague in the audience,
“I am sure Dr. Y knows more about this than me.”
Several times in her presentation on analysis of assessment data she would pause and make comments such as,
“Well, I am not very good at statistics, but this is pretty easy to understand.”
I was a bit annoyed at her self-deprecating manner. I wanted to walk up to her and say,
“You understand this perfectly well and I know Dr. Y, who is very smart and competent, but no more so than you.”
Even more annoying was another presenter, also a woman, also very competent, who gave a very good presentation on assessment. Near the end of it, she said,
“You don’t have to use numbers. For those of you who don’t do math, you can put your students in categories as having exceeded criterion, met criterion or failed. You can just put it in bullet points.”
For those of you who don’t do math …. ????
What the hell? This is a university faculty meeting; 99% of the people in the room have graduate degrees and at least three-fourths of them have Ph.D.’s.
Since when has it become acceptable to not be competent, particularly in math??? Would that same presenter have started a sentence with,
“For those of you who can’t read, I have recorded this presentation as a podcast?”
There may be some people who can’t read because they are visually impaired or have a learning disability, but we consider this a disability, not a lifestyle choice.
This particular department is overwhelmingly female, and I could not help but wonder if the same sort of statements would be made in a predominantly male department? In my admittedly non-random and non-representative experience, the answer is, “No.”
So, first of all, for all of you women (and men), who say you aren’t good at math – cut it out! That’s a lot of nonsense that some people are naturally good at math and some aren’t. It’s a lot like swimming. You aren’t born knowing how to swim and, yes, very few people will become Olympic swimmers, but the vast majority of people can learn to dive in a pool and swim a few laps. It just takes time and effort to practice.
Let’s start with the phi coefficient. I blatantly stole this table from the Children’s Mercy Hospital website because I thought it was very well-explained and easy to understand – until I realized that it wasn’t and I only understood it because I already knew exactly how to calculate a phi coefficient. However, not one to let any act of larceny go to waste, I used it anyway.
The formula for Phi is
Notice that Phi compares the product of the diagonal cells (a*d) to the product of the off-diagonal cells (b*c). The denominator is an adjustment that ensures that Phi is always between -1 and +1.
Let me explain this a little better. We have two categorical variables, gender – coded 1 =female, 2= male, and “Did you eat today?” – coded 0 = no , 1 = yes
In our table below, you can see that there is zero correlation between gender and if you ate today, as males and females are both equally likely to have had something to eat.
Gender \Ate today? NO YES TOTAL
Female 10 90 100
Male 10 90 100
Total 20 180 200
When we subtract (10*90) – (10*90) — obviously, the numbers are the same, so we get zero. There is zero relationship. In the formula above, a, b, c & d are the numbers in each cell.
So, we have mathematically shown that there is no relationship between gender and whether one eats or not. Let’s try another question, “Did you do the dishes?” This time, we get the following results:
Gender \Washed Dishes? NO YES TOTAL
Female 10 90 100
Male 90 10 100
Total 100 100 200
Let’s look at the phi coefficient again.
10*10 – 90*90 = 100 – 8100 = -8,000
100*100*100*100 = 100,000,000 and the square root of that is 10,000
So, our phi coefficient is -8,000/ 10,0000 or -.80. That is a pretty high correlation, considering that the coefficient ranges from -1 to +1.0 . A negative coefficient means that those who are lower on one variable (1= female, 2= male) are more likely to be higher on the other variable (0 = did not do the dishes, 1 = washed dishes).
So, our conclusion is that, while women are no more likely to eat each day than men, they are significantly more likely to do the dishes with data that I just made up to prove it. My daughter, Maria, tells me that any married woman knows that without the need for statistics.
Why did I just go into this in such detail and all about one coefficient? Because I think that is a big part of the reason that many people don’t learn math is that there are so many assumptions that we can “just skip over this”. In fact, the reason I liked the Mercy Hospital site is it did not start out with n10n21 – n21n10 / √(n0+n1+n+1n+2)
and assume that everyone knew what marginal distributions and array subscripts meant, because, I can guarantee you, that they don’t.
Sheila Tobias wrote a really interesting book about teaching and learning science, the title of which is “They’re not dumb, they’re different”.
Maybe, but I guarantee you that part of the problem is that they’re not clairvoyant. No one was born knowing that n10 means the number in the cell where the row value =1 and the column value = 0. It doesn’t help that at other times that same cell would be represented as n11 as the first row and first column.
If you can make that switch in your mind easily, it is no doubt because you, like me, have looked at thousands of matrices and had that notation explained to you so long ago that it is probably like learning to swim, you can’t even remember it. The secret to being good at math is the same as being good at swimming – practice!
Completely random fact – in my misspent youth, I was the first American to win the world championships in judo. If you type judo blog into google, the first of 3,000,000+ pages that comes up is mine. And my most recent judo blog was on outliers and practice. Rather unusual when the two halves of my split personality come together.
As to odds ratios, I have more to say about those, but it is 1:30 a.m. and I have to get up in 7 1/2 hours to go to work, so that will have to wait until another day.

Jan
21
Completely Random
Filed Under Algebra, Dr. De Mars General Life Ramblings, The Julia Group, statistics | 2 Comments
I hate SQL. This is probably completely irrational, like that guy I turned down for a date in junior high school who my mom always tells me founded a very successful company and is making piles of money. No wait, it wasn’t irrational, he always tried to copy off me in Algebra, plus he was just plain boring. I think that is my problem with SQL, too, boredom. There is only so much left join, right join, outer, inner and dataset.variable I can tolerate before my brain tries to escape through my right ear just to get away from monotony. I have met people who love SAS. I have meet people who love SPSS. I have even met people who love Stata. Nobody loves SQL. They are just with it for the money.
What practical use is Mokken’s H, really? Yes, it is true that the maximum phi is determined by the marginal distributions, and if you get a phi of .20, for certain distributions, that might be the maximum you can get, but so what? Maybe I was scarred in my youth by reading some of the articles on bias in mental testing where those who were so determined to prove that intelligence was genetic corrected correlations for attenuation, sometimes to as high as 1.20 and then averaged the corrected correlations!
From a purely theoretical standpoint now, it’s completely different. If you are interested in the analysis of binary data – and how could you not be – you’ll like this paper by David Armstrong, at the University of Oxford. I like it because he is very sensible. He doesn’t take a stance like “You should never use phi, never analyze bivariate data in a factor analysis, ” etc. He takes a very measured view, which I like because really, so few things in the world are always true, except brain-dead obvious facts like you should not correct correlations to be above 1.0 ! (Clearly, I have still not gotten over that.) I have several SPSS workshops coming up. I think I will import the data from our evaluation of after-school programs to illustrate just how much the phi and tetrachoric coefficients move around when the marginal distributions change a lot. It’s a tough job, but somebody has to do it.
I can’t see a lot of people who are experienced SAS programmers switching to Enterprise Guide. Who I can see using it is people who use SQL, ACCESS, Excel or who are just starting to use statistics in their education or profession.
Hello, my name is Catain Obvious…. All that Data Step stuff you were missing and could not find in SAS Enterprise Guide? It was cleverly hidden in the menu under the word DATA.
“Must be a new meaning of the word ‘filter’ with which I was previously unfamiliar.”
Okay, maybe not so obvious is the fact that you need to go under the Data menu to Filter and Query to add two datasets together. I thought filter meant to hold back certain elements. Oh well, I guess it makes as much sense as going to the start menu to shut down your computer.
So, if you want to compute variables, recode variables, add tables or join tables, go to Data > Filter and Query.
Did I mention that I hate SQL ?
Jan
12
SAS was good today and so was the weather
Filed Under Grantwriting, Software, Technology, The Julia Group | 3 Comments

It was 85 degrees in soCal today. Just thought I would rub that in for the benefit of my former and present colleagues in the frozen north. The advantage of having a blog as opposed to doing an on-line course is that I can just randomly switch subjects, which in my office is referred to as “not being corporate”. Don’t know whether it is corporate or not but I can see that a lot of people are going to be loving SAS Enterprise Guide.

SAS Enterprise Guide
I have to confess that many years ago when Enterprise Guide first came out I thought it was one of the stupidest ideas I had ever heard. It was horrendously slow and about as intuitive as building a nuclear reactor out of wood. Have times ever changed! It is still slow and if you look at the log, the code it writes certainly isn’t what I would do. However, you can open SAS datasets with no problem. People who have no idea what a permanent library or LIBNAME statement is can now use SAS to make graphs, do principal components analysis and analyze subsets of their data. Imagine that Excel and Access had a baby, who then grew up and married the love child of SAS and some really cool graphics program that was not SAS/ Graph and didn’t suck. That, in a nutshell is SAS Enterprise Guide. It is full of surprises and almost all of them are pleasant. For example, today, I used the Send to on the File Menu to send a file to Microsoft Word and, surprisingly, it opened in Office 2007. Everything I had read said SAS and Office 2007 were not yet compatible, but that is not the case apparently. It has been hard for me to let go of coding everything because it is such a habit after twenty-six years, but I am trying very hard to put myself in the position of the people who will be taking the Enterprise Guide workshop next month and realize that, to them, typing:
Libname in “c:\amsasex\project7\aimee” ;
data in.disability_study ;
set in.fullsample ;
if “disability_status = “Y” ;
is NOT the easy way. So, I have set myself the challenge of trying to use only Enterprise Guide to solve problems and not doing any programming. I have not succeeded at all, yet, by the way, but I am making progress. For example, I am starting to use the Filter and Query option from the data menu instead of those subsetting IF statements. It actually works just fine. In another post, I had talked about how people continue to use chi-square and ordinary least squares regression even when those are not appropriate at all for their data because they are familiar. I know I am in the same boat. Several times today, I exited Enterprise Guide, wrote the code in SAS 9.2 and ran it because I want to be able to look at my log and see what it does. Yes, you can look at the log in Enterprise Guide but the way the code is written is definitely not how I would have done it. In reality, the vast majority of people are very comfortable not knowing what goes on under the hood. How many people who use Word (including me) have the foggiest notion what the code looks like? Enterprise Guide can be a force for good or evil. It can allow researchers and executives more time to focus on how the sample was selected, the selection of the appropriate statistic and correct interpretation of the results. And it can be used by management-weenies and pointy-head boss wanna-bes to print out pretty pictures and tables with lots of numbers that they only pretend to understand.
My prediction, based on a random sample of zero, is that there will be a lot of both.

