In the shower this morning, I was thinking about how seldom true longitudinal designs meet the assumption of sphericity. This is the point where one of my daughters always makes a comment about what other people think about in the shower. I don’t want to hear it.
Sphericity is an assumption that all correlations among the dependent variable in a multivariate design are equal. This is generally a stupid assumption for longitudinal designs because it is saying that the correlation between say, IQ measured at age three years and four years is the same as between IQ measured at three years and six years of age. In other words, all correlations will be equal regardless of the time interval between them, whether the measurements were taken six months apart, a year apart or three years apart. Admit it, you agree with me and are now thinking to yourself,
“What kind of stupid assumption is that?”
Most statistical packages will test the sphericity assumption and then immediately reject it. Your output will also contain a useful few lines that give the mean square and significance with the degrees of freedom adjusted for the violation of the sphericity assumption. The Greenhouse-Geisser adjustment is a popular one.
I’m in a rush today so if you want to read more about this, the absolute best chapter I have ever read on MANOVA with repeated measures is in a wonderful book called Applied Longitudinal Data Analysis for Epidemiology by Jos Twisk. No, it is not the only chapter on MANOVA with repeated measures that I have ever read. Don’t be a smart ass.
I got this book from the central Los Angeles Public Library, which is an amazing place. It is six stories of books – an entire floor on science, a floor on business and technology, another floor for literature, a floor for children and teen books.
It’s beautiful in fact, intent and effect. Every time I go there it restores my faith in humanity. Not only is there an enormous amount of human knowledge stored here free to anyone who walks through the door, but it is also supported by public funds and private donations. Everything in the library is tangible proof that some people cared about society enough to spend their time and energy on making this place possible rather than on buying another iPod or cafe mocha for themselves. Philanthropy is not a dirty word and caring about making the world a better place, yes, even through your taxes, does not make you a communist. And I pay one hell of a lot of taxes. More than General Electric, I can tell you that.
The library is no longer open seven days a week. Due to budget cuts, they now need to close on Sundays and Mondays. I used to buy the argument that we cannot have the library open seven days a week because it is just plain math. We are broke. We keep hearing that, the state is broke, the county is broke, the city is broke. If we take in $X per year and running services the way they are costs $(X +Y) then we have to cut services. It’s basic math.
But there is that assumption again. The unstated assumption is that we will get no more income and that there is no more income to be had. Michael Moore made an insightful comment on the Colbert Report the other night about the trillions of dollars owned by the richest Americans. Before you start screaming that we are punishing the successful, I’d remind you that managers who ran companies that lost billions of dollars received a government bailout and then hundreds of millions of dollars in bonuses. I’m not sure what definition of successful includes having your company lose billions of dollars.
We keep getting told that if we do tax these extraordinarily wealthy individuals and their corporations that it will cost jobs. Actually, it seems to me that those corporations are already creating jobs in India, China and other countries rather than the U.S. – Andy Grove, Intel CEO, noticed the same thing. (Lois Kazakoff has a great post on this topic and Grove’s ideas )
Just like the sphericity assumption is stupid if you stop and think about it, it is just as stupid if you assume that just because some people are making hundreds of millions of dollars that they are the best, brightest and smartest in America and that we will fail and become bankrupt as a society if we make them pay more taxes than they pay now. In fact, the top 1% of earners paid a larger percentage of their income in taxes and there were greater taxes on wealth over thirty years ago and our budget was more balanced and unemployment was lower. Could it possibly be that part of why they are so rich is that they have the wealth and power to change the tax laws and other laws and regulations to benefit themselves?
If we lay off people who work for the government, unemployment will be immediately lower (as those people lose their jobs). The assumption is that corporations will see their taxes being lowered and then hire more people so employment will eventually go up. That assumption has not been born out by the facts. As Jon Stewart noted, not only did General Electric not pay any taxes but they received billions in tax credits, which amounts to a negative tax rate. If you don’t think Jon Stewart is a credible enough source, you could check out articles other places like the New York Times or CNN.
The assumption we are asked to make is that GE would not be globally competitive if they had to pay taxes. General Electric reported a $5.1 billion profit on American operations last year (let’s assume that is fairly reported, another questionable assumption). They paid taxes on none of it.
I paid taxes last year and I kept working, in America, even. I’d like to try a different assumption, and that is that if we closed the loopholes and made GE and its executives and largest shareholders pay the same tax rate as I pay that we’d be able to keep the library open.
Maybe more people would come there and learn about things like applied longitudinal data analysis, apply that knowledge to developing innovative technology and create jobs.
Let’s try that assumption for a while, because the current one isn’t working
If you are the right age to have watched re-runs of the show on Nickelodeon, Clarissa Explains It All, then you are the age group today’s blog was written for. And don’t tell me the previous statement is grammatically incorrect. After having looked at these results, I’m already pissed off (note to self: don’t swear in front of children)
Look at this next graph closely. It can answer several questions for you, including does education matter (yes, to an extent) and does it overcome the effects of race (some, but not completely).
Notice those lines I drew on the chart. What they show is an effect of completion of education. For every group, there is an increase in mean income at the point where you get diploma or degree. There is a slight bump up when you finish high school, another jump when you finish a college degree, and the biggest jump when you finish a graduate degree, either an MD/ JD or Ph.D. You can’t really see it on the chart (which is why you want to look at the tables as well) but each point you FINISH something there is a big jump. So, for example, going from three years of college to a college degree is a much bigger increase in salary than going from no college to one year of college.
Why the drop down at the end? That is because people who have an MD or a law degree on the average made a little more than people with Ph.D.. Both groups, though, made a lot more money than people with just a college degree.
So, first lesson, if you want to make a lot of money in America, your best chance is to go to college for at least 8 years. Remember that whole thing about mean and median, though? Here is the MEDIAN income by education.
So, is it true that if you just worked hard and got an education you’d be successful? To a point, yes. But that means you have to graduate from high school, be able to afford to go to college, graduate from college with good enough grades to get into graduate school, be able to afford to go to graduate school and graduate with an MD or a PhD.
Anyone can do that – there’s certainly no law against anyone applying, and it’s against the law to just say, “Hey, we don’t accept black people in medical school”, unlike, say, when my grandfather went to medical school in the 1920s.
HOWEVER …. if your parents have money, it is a whole lot easier for you to afford college and to afford not to work during college. If you don’t have to work full-time and can just study, if you don’t have to wait until three weeks into the semester when you’ve earned enough money to buy your books, then it is going to be easier for you to get good grades, get into college and stay in college. If your parents went to college themselves, they are going to be better able to help you with everything from your schoolwork in high school to filling out your college applications.
I am NOT saying this to discourage anyone, in fact, the opposite. I am saying it because when you are in one of those accelerated high school programs or when you are in college, and it seems like making the grades is easier for other people than for you, you are right. If you find yourself asking what’s wrong with you, the answer is – there’s not a god damn thing wrong with you (note to self: Don’t swear in front of the children).
We just saw that with every diploma or degree you get, your likely income goes up A LOT, even with the fact that income is very skewed.
Now, here is another chart from the census data. This one bothers me even more than the others.
In 2009, over half of the Hispanic adults in California had not graduated from high school. This compares to about 20% of the white, non-Hispanic population. The exact figures are 59% and 22%.
Does this mean that fewer people are working? Well, it is true that there is much higher unemployment among people with less education. However, it is not nearly enough to explain the differences in income.
64% of Hispanics are employed, compared to 70% of non-Hispanic whites. That’s a difference but it’s not nearly enough to explain the difference in the chart we saw yesterday which showed that whites have a median income more than DOUBLE that of Hispanics.
So, we see that most Hispanics are working, in fact, the percentage of Hispanics working are just a little below the average for the state of California, which is 67%. Those numbers, by the way, only include people who are working full-time. What happened to that thing about people are poor because they’re lazy and not working hard?
Could it be that those Hispanics who have less than a high school education are all illegal immigrants? That’s the term we kept hearing during the last election, right? The reason our state is having problems is because of all of those people here illegally.
Well, I ran those statistics just on citizens. Before I ran the analyses, before I looked at the results, I had to think about this. I’m Latina. What if it turns out that is the truth? What if the data show that the real problem is we have all of these non-citizens here and they are bringing down our average educational level, our median income, everything … What would you do?
A very wise man told me once,
“The data show what they show.”
I found some mistakes in a study he had done, and he was pretty well-known for his research, and I was just nobody. When I showed him what I had found, he helped me write up my study and get it published. His point was that if you are a real scientist, you have to do the best you can to find the truth and then you state that truth as simply as you can, whether it turns out to be what you hoped to find or not. Just so you know, NOT all scientists are like that, but they should be.
Here’s what I found:
It turns out that if you only include citizens, it makes very little difference. The reason we have income inequality in this country is NOT because we have a whole lot of undocumented people from Mexico. I always suspected that was bullshit (note to self: remember not to swear in front of the children). In fact, including only citizens decreases only down to 57% the percentage of people without a high school education.
So, why did I come to tell you all of this depressing stuff? For one reason, the truth is always better than a lie. If you KNOW that the chances are you won’t make much money, even if you work full time and even if you are a citizen, unless you get an education, then you damn well better get one.
High school is FREE! If you absolutely hate the high school you have to attend, see if you can get into a charter school, a magnet school, a scholarship to a private school – they DO give them. If none of that works and you drop out even though I TOLD you not to get your GED or go to an alternative school.
You can even go to a community college without a high school diploma. Please don’t waste tens of thousands of dollars going to some place like the University of Phoenix. You can go to Los Angeles City College or Santa Monica College for about $600 a semester in fees and if your family doesn’t have a lot of money you can get usually get the tuition waived.
When you get that education, not only will you be making more money but you will be in a position to do something about the way income is distributed in this country.
– unless you’re okay with it being like this?
First you understand the way it is – and then you change it. There was a time when no one could imagine there being a black president. There was a time when universities turned down women, African-Americans, Jews, Asian-Americans for medical school just because they weren’t white and male. It isn’t that way any more.
In fact, over the past thirty years, income has changed to be less equal in America. It can change back.
You don’t have to accept the world being the way it is.
(Oh, by the way, one million brownie points to the teacher who knew me but invited me anyway.)
Given that the students are 12 – 14 years old, I did not want to make the analyses too complicated but I wanted to hit some basic points.
First, that statistics can tell you the truth about the world, truths these kids already know because of the environment in which they live, but which are often glossed over in the news.
Second, you need to look past the simple answers. Statistics involves asking questions about your data. For example, if you look at what percentage of people in California have jobs it’s about 56%, but that includes newborn babies and people over 100 years old. To get better answers, you should look at the people of “working age”. Who gets to say what ‘working age’ is? If you say that it is over 20, because people are in college, that gives you a different answer than if you say it is over 17 because most people have either dropped out or graduated from high school and are looking for a job by 18.
Third, America is not nearly as equal as we like to pretend. Income is radically skewed and almost everything – income, education, employment – varies by race and gender.
Everybody is not equal – not even close
To understand income in America, you need to know the difference between the mean and the median, a difference that turns out to be very, very important in understanding income. The mean is the average, to get it, you add up the total of something, say, how much money each person makes, and divide by the number of things (or, in this case, people) that you collected data on. So, if we have nineteen unemployed people, all making zero dollars, and someone who is making $20 million then the mean is the total of how much everybody altogether was making divided by the number of people, or $20 million divided by twenty. The mean is $1 million. Now try telling those 19 unemployed people that everything is great because the average person in the room is making a million dollars a year. How many of them would slap you at that point is a statistic I don’t have, but it would be interesting to know.
The MEDIAN is the middle of a distribution, that’s the point that half of the people are higher and half are lower. For our group of 20 people, the median income is zero dollars. When a distribution is very skewed – that is bunched up on one end and then going out very, very far to the other end, using the median gives you a lot better picture of what it’s like for most people than the mean.
Let’s look at the distribution of income:
- Distribution of Personal Income in California
Of the working age population, which is people from age 18 to 62 by my definition, 51% in California make $20,000 a year or less. According to a couple of sources I found, the highest paid 400 people in the country earned over $300,000,000 a year . About 10% of the wealthiest people in the country live in California, so I am assuming about 10% of the highest earners do as well. Certainly, some of them do.
So, the highest paid 40 (or so) people in the state each make in one year what the bottom half of the people would make if they worked for FIFTEEN THOUSAND YEARS!
The wealthiest person in California is Larry Ellison, CEO of Oracle and reported to be worth $22.5 billion. Now, this is the amount of money he HAS, not how much he earns in year. For that bottom half of the people, they would need to work 1,125,000 years to make that much money. Are you getting the idea yet that just working overtime isn’t going to get you there? Of course, the way you get $22 billion isn’t by working for over a million years. You get some money and then you invest it in factories, buildings, hire some workers, invest in bonds, stuff like that. Have you heard that saying that the rich get richer and the poor get poorer? Well, the fact is, they do.
Median income is the part that half the people make more and half make less. For California, median income for people from 18-62 years old is $20,000. Remember our nineteen unemployed people and the one guy with $20 million a year? This is starting to sound like that, isn’t it?
Here are two interesting facts I found on the Internet,
- The census bureau says Generally, the long-term trend has been toward increasing income inequality, and
- According to Professor Domhoff , most people in America are very far off in thinking how equal America is.
I was talking to my friend today – that’s him on the left. Like Larry Ellison, he’s president of a company he started. He isn’t worth $22 billion, though. He teaches about ethics (think of it like church without mentioning God), and he said he was confused that even though he saw lots of people who didn’t have the values he teaches about like honesty, courage and generosity, that still America is the number one country in the world so maybe those values aren’t so important. He asked how I explained that. I asked him what he meant by America is the greatest country in the world and he was surprised.
“You know, we have the most freedom, equality, democracy.”
In fact, when it comes to income equality, we are not even close to number one. We’re number 93.
So … why is it so unequal?
Race is related to income – by a lot
The graph above shows average personal income by race for California. Yes, the rate for whites is more than double that for Hispanics. Maybe, though, this is due to a few really, really rich people. After all, Larry Ellison is white. Maybe it’s all his fault.
No, that’s not it. We do see a couple of really interesting things here. First, for EVERY race, the median is a lot less than the mean. When you see a big difference between the mean and median like that, it usually means the distribution is skewed, that is, there are a lot of people bunched at one end. Okay, we already knew that. This tells us though, that the distribution is skewed for every race. It also says that the difference between races is NOT due to just a few really rich people. Larry is off the hook.
Perhaps the difference is due to age. People tend to make more money as they get older, and then less as they retire. (This is called a curvilinear relationship). The graph below shows the relationship between age and income by race.
The Difference in Income Between Races is not Due to Age
You can see that for all races income goes up from about 16-30. For Asians and non-Hispanic whites it goes up at a steeper rate and for a longer period. So, no, the difference is not due to Hispanics being younger (although it happens that they are). Hispanics, and, to a lesser extent, blacks, make less money at all ages.
So … it isn’t age, it isn’t due to just a few really rich people. Why is there such a difference between races? Is it all explained by education? How did Hispanic get to be a race? Doesn’t the census say that Hispanics can be of any race? And who are those “other” people anyway?
All good questions, but they will have to wait until the next blog because it is close to midnight and I can hear very loud music upstairs from the bedroom where my own little seventh-grader is very obviously not asleep. I must now transform from evil statistician to evil mother.
I was very thrilled to be invited to speak to six classes of seventh- and eighth-grade students at an urban school. Actually, they wanted me to speak to seven classes but there is no way on earth I am getting up at 6:30 a.m. or whatever ungodly hour would be required for me to make it to an 8 a.m. class.
These students live in an area where basically everything you want to be low is high – poverty, crime, unemployment - and everything you want to be high is low – education, income, fluency in English. I spoke to a teacher at a similar school and she said her students were very interested in issues of race and inequality. In her words,
“My students aren’t stupid. They’re getting screwed in America and they know it. There just isn’t anything they can do about it because they’re all, like, thirteen years old.”
Failure #1: Summary Tables
My initial thought was to download the summary tables from the census site, read those into JMP file and create the graphics from there.
I tried downloading some summary tables of those characteristics but concluded after a day of messing with trying to get the data into the format I wanted that it would be far easier to do it in SAS. Now, if you don’t know SAS, it probably would take you more time to learn it than to to just go ahead and use JMP but, hey, at the end you’d know SAS. (Note to self: Learn more about JSL).
Everyone is complaining about the price of SAS products and I have always been at a university or corporation that paid for the license. So, I thought I would actually check the price, and holy shit, this stuff is expensive. I thought perhaps I should see what I could do in OpenOffice spreadsheet on Unix just in case I run out of clients or employers with licenses and have to pay for the software myself. Unfortunately, the spreadsheet application has a limit of 65,000 rows. Also, it occurred to me that if I didn’t have any clients, I wouldn’t have that much need for the software, now would I?
Anyway … the whole summary tables thing didn’t work out because I wanted variables defined more simply than the Census Bureau did, because this is for a middle school class. For example, I wanted two categories, employed and unemployed, rather than the six the census uses.
Failure #2: JMP
I downloaded the PUMS (as in Public Use Microdata Set) from the American Community Survey for California. This is a 1% sample of the state – 352,875 people, to be exact. You can download it as a SAS dataset, which I would recommend.
My initial thought was to download it, do a few data manipulations with SAS, then output it to a JMP file and create the graphics from there. I like JMP in part because it does good graphics easier than SAS and because it runs native on Mac OS. Using a Windows requires moving 45 feet to a computer downstairs or waiting approximately 15 seconds for a virtual machine to open, thus negatively impacting my quality of life by requiring movement or waiting. I am an American, after all.
Hold that thought – I still think SAS to JMP is a good idea but there turns out to be a bit more massaging of the data necessary than I had originally planned.
Success? Data Massaging with SAS
As I mentioned previously, finding how many people are unemployed was not a simple matter of looking at how many people said they were unemployed, although that would certainly seem like a reasonable way to do it. It also seems reasonable to raise taxes on people and corporations making over $10 million a year and fund health care, education and fire departments, but we don’t see that happening, now, do we?
My first thought was to create a dataset that was a subset of the variables I needed in SAS, do a little bit of recoding and then run the graphs using Enterprise Guide. This did not completely work out.
In trying to find a way to use a weight variable in graphs for SAS Enterprise Guide (which I never figured out in the twenty minutes I spent on it), I came across this Freakalytics blog on The Joy of SAS Enterprise Guide. While I agreed with most of his points, one point made by a dissenter in the comment section I have to agree with – SAS Enterprise Guide IS slow. I’m running it under VMware on a Mac with 2GB allocated to the virtual machine and I still have something up in the background like a bid I’m working on, or am answering email while I wait for the next step to pop up. How annoying it is to you no doubt depends both on your work habits and patience. Since I probably think 60 times a day, “Well, what about this?” waiting a minute for each analysis takes an hour of my time each day. Yes, I do remember when we would submit jobs to run overnight and pick up our output and that green and white lined computer paper the next morning. No, I am sure the fact that I was using a dataset with 325,000 or so records didn’t help. Your point being?
I ended up doing some recoding of the data in SAS 9.2, then opened the dataset I saved, with just the variables of interest, in SAS Enterprise Guide. Even though I thought I had done the recoding I wanted in SAS, I still ended up at several steps creating a filter to, for example, only include the working age population, or recoding race to drop the “other” category.
Again, a lot of this was due to my need to reduce the categories to better fit the level of the group to whom I was presenting. Finally, I lumped all the things I needed to recode, subset, etc. into one program and ran that in Enterprise Guide, then went ahead with my analyses.
Three points regarding Enterprise Guide.
First, I had thought maybe I could just take 10% or so of the 325,000 and use Excel or OpenOffice. If it was a random sample, it should do almost as well. One reason I am very glad I did not choose that route that is the MODIFY TASK option in Enterprise Guide. I am constantly wanting to look at results in a different way, and this allows me to do that without starting all over. Inhabiting some parallel universe where this looked like a good idea, Microsoft has made it so the latest version of Office for the Mac doesn’t allow you to record macros. How much does that blow? Answer: A lot
Second, I think the SUMMARY TABLES option is much better than TABLE ANALYSIS for the types of things I needed to do. It just allows a lot more flexibility.
Third, since I didn’t see any way to use a weight variable and then get a percentage in a pie chart, I ended up doing summary tables, outputting the results to a dataset and then analyzing that dataset. I did compare weighted and unweighted results and it really did not make that much difference.
Once I started getting results, was both the fun and the depressing part. Actually doing the statistics is the fun part, but the results were not what I would have if I ran the world. Several times, I have re-run results, or compared them with census data for the state or nation thinking this can’t possibly be right – but it is.
Below is the distribution of income. In California, median personal income, that is, the income for the median person surveyed, was under $24,000 per year. I looked it up in a frequency table to get a more precise estimate and the amount was $20,000. The census gives a more cheery $26,000 or so. The difference is due to the fact that their estimate is based on “full-time year-round workers” whereas mine was based on “people of working age”, which I defined as 18 – 62.
For people who want to believe we have a fairly egalitarian society, this is a pretty depressing chart. And it gets worse …
You know those people who have perfect theories about raising children and then they give birth to an actual child who (surprise, surprise) progresses from throwing up baby food on the cat to staying out after curfew? Well, the open data initiative seems to be like that. The people who are full of hype about the 300,000 plus datasets released by the federal government alone, not to mention data from other countries – apps galore are going to come out and make brilliant people millionaires while solving all the nation’s problems – they don’t seem to actually have a lot of experience with it.
Now, I love the idea of open data. Not nearly as much as I love my children, but still, I think it is amazingly cool. The applications produced by the federal government are pretty cool, too. ArcExplorer2 is a handy-dandy FREE (as in free beer, not free puppy) application I used for analyzing FCC data . Not to be topped, the Census Bureau offers Data Ferrett. If you don’t want to do your own analyses, you can download this nifty little critter and be producing analyses in next to no time. So, I did.
You can also, if you want to do analyses on the actual data, download the PUMS which is the public use microdata. I downloaded a SAS datafile that was 1% sample for the state of California, which is around 320,000 people. That part was a piece of cake.
Then, I read the documentation to figure out what those 80 weight variables were for, whether agep was something different than age (it isn’t), what PINCP was (person’s income from the person record). I wrote some formats because I wanted the results to print, for example, “State Government Employee” rather than “4″ to label each row.
That being done, I ran some statistics of my own and compared them against the data ferrett results, being careful that both selected only California as the state and both used the proper weight variable. The two sets of results came out identical. What could be better? Except…. (you knew it couldn’t be that easy) ….
For example, the data show that only 293,096 people in California are unemployed, out of 36 million. Hallelujah, recession over! Both Dataferrett and my little SAS program gave the same result. This is based on the “Class of Worker” variable, which asked people where they worked – business, non-profit, state government, self-employed and so on. Less than 2% said they were unemployed BUT this variable was blank for nearly a quarter of those surveyed. I’m going to guess some were unemployed, some retired and some just felt it was none of your damn business.
I called the Census help desk for PUMS and they were quite helpful. In fact, I don’t know why people constantly bash government employees. They generally seem to do a good job. When I asked why such a glaring discrepancy with the economic data one normally reads in the paper, the gentleman responded,
We record what the respondents tell us, which is a very different method from how many of those indicators are gathered, say, using unemployment claims. People don’t like to say they’re unemployed.
So, presumably, they leave the question blank or provide information on the last job in which they were employed. Fair enough. My point, though, is that if you were going to just do a frequency distribution and look at the percentage who said they were unemployed to get a percentage of people in California who are unemployed (which sounds like a pretty reasonable thing to do), you would be FAR off.
The government, it turns out, is smarter than you think. Since people don’t volunteer they are unemployed when asked what sector they work in, there is also a question on whether they worked in the last week, counting paid vacation. If you look at this question, you’ll find that 23% of the people said they did not work in the previous week. If you drop out people over 62, presumably some of whom are retired, it drops to 22%. So, is the unemployment rate the less than 2% who gave the sector in which they worked as “Unemployed” or is it the 22% who didn’t work for pay last week?
By this point I have been looking at the PUMS data for two days. I have downloaded the data dictionary, the actual survey questions, the coding for the questions, the data itself and a couple of tutorials. Even with all of that, I haven’t found where they actually explain how they coded ESR which is the “Employment Status Recoded” variable.
What I do find, running my SAS PROC FREQ is that 99.9% of those shown as employed in the civilian or military labor force reported working last week and 99.9% of those who were coded as unemployed or not in the labor force did NOT report working last week. Now, that’s what I’m talking about – the people who are unemployed should not be in paid employment (that being the definition of unemployment and all).
Based on this final variable, 11% of the people surveyed in 2009 were unemployed with another 2.5% who reported that they had a job but did not work in the past week. I presume these are people in seasonal work like farming, or jobs like firefighters where you don’t necessarily work every week.
Bottom line: There are a lot of great resources available for analysis of public data, including data sets, codebooks and applications. However, this is not Google. It’s not even Wolfram alpha (which I have found very disappointing, but that’s another post). Finding answers for yourself using even very well documented datasets is a lot of work and, unlike raising children, I can’t imagine the average person is going to have the time or inclination to do it. I mean, at least with children, you get to have sex a few times at the beginning and I don’t see the Census Bureau offering anything nearly equivalent as an enticement.
Does this mean it’s a bad idea? Not at all, and I do think, with more encouragement, a great deal can be learned. I chose this dataset because I am giving a presentation for middle school students in a very low income neighborhood and I wanted to show them how they can use data to analyze their lives – the poverty rates by race, education and age, the distribution of income, the disparities in unemployment and more. I think it will be fun and interesting because I DON’T work for the government so I don’t have to be politically correct. We can sort the data by income and go backwards down the list of the highest earners until we find the highest-paid Latino in the sample. We can look what percentage of people earning $X are African-American females, see how skewed the distribution of income is and the huge difference between the mean earnings and median earnings.
My points, and I have two, are:
- There really IS potential for ‘crowd-sourcing’ the analysis of open data, if people who need to use data anyway, e.g, for teaching, class presentations, research papers, choose to use THESE data.
- The useful work done by the people in #1 could be greatly extended by curation – having a repository where the results are posted, categorized, critiqued and edited.
Unfortunately, I don’t see either of these things happening much at the moment.
Yesterday, Matt Keranen (a.k.a. @HybridDBA) made the comment about SAS giving free software to universities as response to R << “Would be nice if they did the same for developers”
The more I thought about that, the more convinced I became that he is right. There are a few reasons. First, let’s look at the two reasons that SAS has for giving free access to universities:
1. What you learn in college is probably what you are going to use when you graduate, if you have any choice. At large institutions — companies, universities — the decision to continue paying for a site license is related to how many people use it. If a person comes into an organization and is told they can use SAS, Stata, SPSS or R and they know SAS, that’s what they are going to choose. When the decision to renew the license comes up, the cost in the budget will be balanced against the fact that there are, 1,478 users who would be inconvenienced if they were forced to switch packages, not to mention the cost of all of that code being rewritten, documentation, etc.
2. SAS really isn’t losing sales by giving it away free. The version SAS distributes free is limited in the number of observations, and run on the SAS servers, which are slower than running on your desktop. Someone doing heavy-duty analyses at the university is going to need the full edition and pay for it. Students, who are broke, usually working for minimum wage, if they have a job at all, and just doing small scale studies for class assignments, might have turned to R if they had to shell out a lot of money. University researchers, who are paid more than minimum wage and have a lot of demands on their time are going to still find it preferable to use SAS because their time is much more valuable in terms of lost income and scarcity.
So …. I’m sure the folks at SAS are nice people and care about education, but it doesn’t hurt them that this generosity is likely to help protect them from encroachment of R in the market.
Why, then, should SAS give away its product to developers, who, if they are any good, are making way more than minimum wage?
Free beer, free speech and free puppies
Plenty of people say that free software should be thought of not as free beer but as free speech, as a moral principle.
Personally, I go more with the theory that R, Linux, Ruby and other open source software is more like a free puppy.
A puppy is awesome if you have the time.
Also, the number of free puppies you are likely to accept is a lot lower than the number of free beers. BUT that puppy you do accept, you’ll come to love. You’ll spend enormous gobs of time with your puppy and get to know it.
Here’s another free puppy thing … having a dog is a social activity. You’ll meet other people who have puppies. As more people learn R (or Linux or Ruby or …) there become more and more resources out there. Look at Linux – there is the Ubuntu forums where you can find just everything Ubuntu and a bunch of other sites for every other flavor of Linux. Even a Puppy Linux forum. There are a lot of really good books available. I’d say the documentation and support available for Linux is better than Microsoft’s, from an individual perspective. Of course, a lot of organizations have their in-house Windows support.
Is Linux a threat to any commercial manufacturer? It’s open to debate. There is a good article on arstechnica that said Linux was 1% of users worldwide but over 6% of their readers. There’s a lot of debate on how big the penetration of Linux really is on desktops, which doesn’t interest me that much personally, but if you’re really into it, this Computerworld blog discusses a bunch of studies. On the other hand, there is no question that Linux is a major player in the server market and at least PC World thinks they are on the rise. They aren’t the only ones, but others claim Linux server usage is on the decline.
None of that is my point, actually. I think if you wait until Linux or Google or Microsoft or SPSS or whatever has 92% of the market share you have already screwed up big time. What corporate weenies ought to be getting paid for is to prevent competition from taking root. So, just like with the free educational offering for universities, here are two reasons I think it is in the selfish, vested interest of SAS to give it away free to developers.
1. It supports the people who are giving SAS a competitive advantage.
What is the biggest benefit of SAS over R ? Probably ease of use. SAS has an easier learning curve than R, especially if you use SAS Enterprise Guide. SAS also has unbelievably great technical support that you can call. They also have sascommunity.org and SAS-L , both completely volunteer activities, the users groups – local, regional and national – which SAS supports somewhat but are huge volunteer efforts. There is also the SAS publishing arm which cranks out good books that the authors probably made $3 an hour for if you paid them for their time.
Who are the people who are on SAS-L all the time, writing conference papers, volunteering in user groups, writing books? I’ll bet that a disproportionate number of them are developers. Like giving SAS free to students because they are then “hooked in” to the SAS community, there is an incentive for providing free software to those who are already spending an inordinate amount of time promoting it – not just because they are providing that user base, documentation and community that open source software communities have, at their best. R is growing rapidly in the body of resources available on-line, in print and through user groups. If you are a new person coming into programming, statistics, analytics or whatever the buzzword is we’re calling ourselves today, you have plenty of cool things that are free to which you can devote your time - R, Ruby, Linux - and a bunch more. Of course, of all of those, R has the most direct head to head competition with SAS. The developers are people who don’t mind getting into the nuts and bolts and they are the least likely to be put off by the less user-friendliness of R. Giving it away free is one way to retain those people in your camp.
Equally important in retaining a market advantage for SAS is having a mountain of legacy applications out there. Yes, I know that in a perfect world the better product wins, but on Planet Earth here there is such a thing as fixed costs. I even know a person who is a COBOL programmer, for God’s sake, and it is not because COBOL is superior to C++ or Objective C or Ruby or Perl – or, well, anything. It reminds me of an unimpressive co-worker I once had who, when I asked an extremely honest manager about him replied,
“Well, uh, he’s a nice man and he’s here and uh, he has a job and at one point he probably did good work and he’s been here a long time and it would be too much time and money to get rid of him so, uh, he has a job.”
Having applications that need to run every day, week, month or even year for your annual reports makes it a whole lot less likely that your organization will decide to get rid of a product supporting that application. Personally, when I put in a bid on a contract I usually recommend a client use SAS because it has a lot to recommend it, not the least of which it is fairly intuitive so it is easier for their own personnel to understand and maintain. There’s also the fact that I have used it forever and a day, so I can write programs relatively quickly and my time is billed by the hour. Sometimes the client wants SPSS, Stata or even Excel. Far be it from me to question the people paying the bills, unless they are completely wrong. (“With all due respect, Excel would be a bad choice for structural equation models and doesn’t have a procedure for multiple imputation so, yeah, we won’t be using that.”) If they leave it up to me, though, they get SAS.
2. SAS really isn’t losing sales giving it away for free. Let’s be honest here. I’m not going to buy a SAS license. I am always working for some company or university that has it installed and I get it under their site license as a faculty member, or consultant or employee (sometimes clients want to pay me as a regular employee – as long as the checks are for the same amount and don’t bounce, it’s all the same to me). Apple had a pretty innovative way of giving away its developer tools for free – the cost was $500 but if you bought a new computer the package included a $500 voucher for a new computer. We bought a new computer every year since we got the $500 off, also we like shiny things. SAS is full of smart people. They could probably come up with some innovative way to see that a free license really doesn’t cost them anything.
In short – well, it’s way too late for that now – so, in summary, giving away SAS to developers might win some good will, increase entrenchment and not cost any real money. Kind of like the whole academic thing.
P.S. The puppy, since I always get asked about her, is about six months old in that picture. She now weighs more than me. She is a Dogo Argentino.
Well it appears that you have followed the advice of the narrator from Rocky & Bullwinkle and “tuned in next time” to read Part 2 of my rant on what I am passionate about, which happens to be data, math and cutting through the bullshit.
Hmm … may I respectfully suggest that you consider getting a hobby to fill your spare time, or failing that, a job? Suit yourself ….
As I was saying, much of our social and economic problems, both national and personal, are a result of believing the wrong data, which is related to not understanding how to evaluate data. This is compounded by the fact that some people have a vested interest in promoting false data.
I never lost any money on derivatives because I never invested in them. It never made sense to me that you could bundle up a bunch of mortgages by people who were high risk of not paying their loans and adding them all together made a good investment. It shouldn’t have made sense to you either.
To give another example, one of the reasons that Bernie Madoff was able to get away with his scheme for so long is that his victims didn’t even know what their financial advisers were investing in. Those advisers apparently did not understand the contradiction between Madoff’s results and all of the data on financial markets.
What about all of those people with graduate degrees who are supposed to be monitoring investments? Didn’t they raise a red flag? According to The Economist, a few people did and their firms advised investors to stay away from Madoff’s fund.
What about the others? Funny thing, that, but a very large percentage of students over the years have told me they didn’t need to learn all of that math because they were never going to use it. I have to agree that it’s pretty logical that you aren’t going to use something you don’t know. Fact is, if you don’t feel very comfortable with the equations and programs that are supposedly telling you that you are getting 10% returns year after year or that batches of high-risk loans can be bundled together to be a giant low-risk, then you don’t assume it’s bullshit, you just assume you don’t understand it. You’re afraid to say the emperor has no clothes because you think it might reveal that you’re not as brilliant as you think you’re supposed to be.
I’ll reveal my deep, dark statistician secret. There are times when I read some conclusion, procedure, equation or program and I don’t understand it. Sometimes I have to read it two or three times and still don’t get it. Sometimes I ask the house rocket scientist and he doesn’t get it. So, I download articles, buy a book or two, ask other people I know, ask questions on a forum or mailing list, read another book, try coding it myself, write equations on a piece of paper, scratch those out, swear, and try again. Sometimes the rocket scientist will get interested, too, and the housekeeper will come in the morning to find scraps of paper with equations, all of them wrong, littering the house, and the world’s most spoiled twelve-year-old complaining that she is dying because no one would quit what they were trying to figure out and make her chocolate milk.
If someone claims that the Beijing Multiplier (something I just made up) triples the accuracy of predictions of the birth rate compared to a regular regression equation with the same independent variables, I’m likely to download a set of data on birth rates in 40 countries over the past ten years and check.
I was watching a documentary on the whole Madoff scandal where the interviewer asked one of his early partners, who had made millions,
“Did you ever ask to see the equations he was using, how he was doing this?”
and the man laughed and said,
“No, I wouldn’t have understood it if he’d shown them to me.”
I thought to myself, how could you invest millions of dollars and not know what was going on?
Lately, I have been hearing things that are just as stupid and not many people seem to be questioning them. For example, that the effect of laying off government workers will be to decrease unemployment.
Let me explain this in simple terms. Unemployment is the number of people without jobs who are looking for one. If a bunch of people lose their jobs, then there will be more people without jobs. Those people all bought stuff like cars, houses, Pomeranians, dog food to feed to the Pomeranians, pet-sitting services, car washes and McDonald’s cheeseburgers. If you lay those people off, the immediate effect is that they will buy less stuff and so all the car-, hamburger- and Pomeranian-salespeople will have less money.
There is a theory that by laying those people off that EVENTUALLY deficits will shrink, credit will become more available and businesses will add jobs. I haven’t seen a lot of data to support that theory. In fact, what data I have seen shows that businesses add jobs when they expect to have increased work – sales, production – for those workers and it seems that laying off workers will have the opposite effect.
I’ve seen no data at all to support that it will be reasonable to REDUCE taxes because the numbers of dollars saved by laying people off never add up to the whole deficit, much less a surplus.
Here’s another example – arguments have been made that if we don’t award multi-million dollar bonuses to Wall Street financiers that they will not work as hard to make money and that we won’t get people who are as smart. Somehow this argument ignores that the people investing with Madoff and running banks that received billions in government bailout funds got bonuses anyway. One has to ask how smart they really are, unless smart is defined as being able to get other people to give you money, in which case Madoff and his fellow fund managers are very smart indeed.
I would like to see some data correlating salaries on Wall Street including bonuses with figures like GNP and unemployment. I’d also like to see data correlating tax rates on the highest 1% and 10% of incomes with unemployment rates, GNP and GDP, both in the U.S. and globally.
What good are data? Data give us the raw materials and mathematics gives us the tools to cut through the bullshit that is thrown at us far too often, both intentionally and by others who have accepted and passed on, unquestioning, whatever they were told.
What really gets me is the latter, the people like Madoff’s colleague who said he wouldn’t have understood the equations if he’d seen them.
DON’T ACCEPT NOT KNOWING WHAT’S GOING ON !!!!!
Getting rid of the bullshit – I think that’s something worth being passionate about.
I was at a conference recently when someone asked me what I was passionate about and I answered, “Data”. He seemed very disappointed and wandered off in search of someone more interesting who would give a better answer like curing AIDS or ending world hunger or solving global warming.
I could have added , “and math” but I don’t think that would have improved the situation.
In fact, three issues I care deeply about are poverty, education and inequality, all in equal measure because I think the three are inextricably related and they are all related to – yes, mathematics and data.
If the world ran more efficiently, we could produce more of everything at a lower price, and, other things being equal, people could have a better standard of living on the same income.
If people were better educated, their personal productivity would be higher and their standard of living would be higher, other things being equal. If they were more literate and had a better grasp of data and mathematics, they would question the inequality in the world and ask why it is acceptable in America to give hundreds of millions in tax breaks to people whose companies needed government bailouts to the tune of billions of dollars while cutting benefits of people making $40,000 a year. They would take fewer things on faith and demand proof.
At the same conference, some gave a figure that there were currently 260 open bids on social media services to government. (These aren’t the exact terms/ number; I changed the details because my point is not to call out a specific person.) That figure was live-streaming on the Internet, the statement was made by a person in authority and cited in a couple of other presentations the same day; someone even intended to cite this in a publication on-line.
The only problem is, it was completely wrong, off by a factor of about 100. There were more like two open bids, if any. If you went to that site and searched on “social” and “media” and “government” you would get hits like:
Need temporary file clerk. Must be able to reproduce electronic media like CD-ROMs, have good social skills and obtain clearance to work in government facilities.
How did I know that wasn’t correct? I have some experience with government contracts and that just did not fit with the bids I had seen coming out. So, I went and checked.
This is one reason I am very concerned about the decline of traditional news such as the Los Angeles Times and New York Times. You can call them dead tree media all you want but one thing both have that a lot of bloggers (and don’t even get me started on Facebook and MySpace) don’t is FACT-CHECKERS .
The amount of information on the Internet that is just plain wrong is mind-boggling. Yes, there are plenty of books and radio talk shows as well that are pure fiction masquerading as historical, political, economic and scientific truths.
Why do we accept this? Why don’t we challenge it? Why, in a room full of people with iPads, laptops and smartphones was I the only one who checked this incredibly inflated number? Was it because it just happened to fit with what people wanted to hear?
I don’t know the answer to any of those questions but I do know this. It doesn’t have to be that way. It can start with us. When I started this blog I wrote it just for the hell of it – well, I still do – but then I realized that other people were reading it and might take things seriously if I said that PROC MI was to make up numbers to replace the ones missing (I was kidding and it doesn’t do that. You can read a serious explanation of multiple imputation here).
I now specifically state if I am kidding about something because it is not always obvious to other people. I try to specify when I am stating my opinion, when I am using a hypothetical example to make a point and when I am stating a fact, e.g., this is what the Tukey post hoc test does.
One thing I am not very good about is calling out specific errors, although I do respond and appreciate it when someone else does that to me. In the post above on multiple imputation, a commenter pointed out that it wasn’t a completely random sample. I said he was correct and explained why I did it the way I did for the example.
The more appealing thing to some might be to start running around like a bunch of fifteen-year-olds and calling out every blog and tweet that has a factual error. That’s not a terrible idea and if you actually are fifteen years old or just feel like it, far be it from me to stop you. Since I cannot pass a candy store without going in and buying one of everything, I’m the last person to give lectures on maturity.
Or, you could start with being more careful on your own writing, on what you post, what you tweet. Maybe you could make a major effort to analyze data and publish your results on-line. I’m going with that option because it happens to be more comfortable for me.
Either way, I think we need to quit being so cavalier about,
“Oh yeah, tons of what is on the Internet is bullshit.”
I’m passionate about having data accurate and available for evaluation because I believe that the world would be better off if we had all of the facts, if what we believed to be true actually was.
Much of our social and economic problems, both national and personal, are a result of believing the wrong data, which is related to not understanding how to evaluate data. (Pity the poor person who went out and invested in social media stocks based on the booming market figures.) This is compounded by the fact that some people have a vested interest in promoting false data.
How do we get from wrong numbers, to poverty, poor education and inequality? There’s a reason the title of this post is Part 1, you know!
than the answers.
This was certainly the case at the Tech Coast Angels Fast Pitch Competition at UCLA last Thursday. In this competition, ten finalists are selected to give a 90-second pitch on their start-up. The presentations are rated on investment potential and presentation with members of the TCA holding up cards like a gymnastics or figure-skating competition. Afterward, the raters ask questions of the presenters.
If you are curious about the types of questions investors would ask but you couldn’t fly to Los Angeles, or didn’t buy your ticket before the event sold out, here are some of the questions I found most interesting.
If you are thinking of starting an e-commerce site or app, be prepared to answer these questions:
What is the cohort retention? How many people come back one week, two weeks and thirty days?
How do you get customers? (e.g., advertise on sites targeted toward your customers, do you offer a free download and then they can pay to get extra content – “freemium”)
What is the cost of customer acquisition? That is, how much money do you spend on marketing for every paying customer?
How are you looking at customer acquisition costs versus modeled life time value?
What is life time value of a customer? That is, for each customer you acquire, on average, how much money will you receive from sales to him or her.
Does your e-commerce website include an app? How will you maintain ranking within the app store? How do you keep from being a very popular app that then goes to zero?
If you have a better version of an existing product ….
Who are the entrenched competitors and how do you plan to compete with them?
Why would somebody spend to buy your system versus just going with the competitor?
How and why did you choose the price you are asking for your product? Have you tried different price points? Higher? Lower?
How long have you been working on this product? How much does it add to the price of a product?
What specifically does your intellectual property protect?
If you have received previous funding, what has it been used for?
Is this a business to business opportunity or business to consumer?
Do you have answers to these questions? Well, maybe you should.