Because I have written a lot of grant proposals and reviewed a lot of proposals, every time that commercial comes on at 3 a.m. saying,
“The government wants to give you free money!”
I want to do an Elvis impersonation and shoot my TV.
The government does not want to give you money. You have to knock the government to the ground and wrest the money from its grubby hands.
Let me give you a few pieces of advice so that the next time I am reviewing grants I am not tempted to drive over to your house and hit you with a stick for wasting my time.
1. Don’t bother reading the instructions carefully. Read the instructions. Yes, the instructions are usually over 100 single-spaced pages. Read all of them. With a highlighter. Everywhere it says MUST or REQUIRED or ELIGIBILITY, high-light that. Read the instructions as you write the proposal and then review all of those highlighted parts again after you have finished the proposal.
2. Don’t get creative in your proposal organization. If the instructions say you should have seven parts, for example,
- Literature Review
- Target Population
- Program Design
Please have those seven sections in that order. I read proposals pretty carefully but if I have marked you off for not having a literature review and I find that for some bizarre reason you have included it with program design, then I need to change my scores.I’d like to think if I need to change my scores once or twice I will be unbiased about it, but I can’t guarantee everyone will. Also, it is possible that I will miss something if it is not where I expect it. Keep in mind that I probably have 3,000 pages to read in a week or two on top of other work. If I can give up to 10 points for program design and you have addressed all of the issues, I mark 10 points on the form and skim the rest of the section. If your literature review is in program design and I miss it, oh well.
3. Don’t get cute and try to get around page limits by including seven appendices. You do know that reviewers don’t always read those, don’t you? It is assumed that everything will be continued in the narrative within the page limits.
4. Don’t omit or try to hide information that is relevant to your proposal because it reflects poorly. Fix it! The most common example of this is having a very low amount of time committed to the proposal by certain people. Don’t get cute and leave off the FTE for personnel. That is just going to make me suspicious and I’m going to look in your budget to find it and mark you down for it being too low. If I can’t find it, I’m going to mark you down for not having that information. Same goes with reliability of measures and any other potential red flag for reviewers. If it is not there, I will assume it is bad.
5. Don’t give the names of famous people in the field and then have them in your budget for 1% or 2%. It is nice that famous Dr. Joe works at your institution. You don’t get points for him being in the building.
6. Don’t try to b.s. on the needs section by ignoring covariation. You could say that people with wrinkles are 97% more likely to die in a year than people without, therefore your cosmetic surgery could improve depression incidence and save 27 million lives. Um, people with wrinkles are old (I should know). Give a more accurate estimate of the number of people who are really depressed due to age-related changes in their appearance.
7. If there are points for inclusion of under-represented minorities, don’t just have a statement saying you don’t discriminate. That’s the law, for Christ’s sake! You don’t get points for that. Other things you don’t get points for are not committing a felony, not embezzling, not sexually harassing your employees and not poisoning the Dolphin Pool at the Mandalay Bay.
8. If there are points for inclusion of under-represented minorities on your staff, don’t mention that there are students in your school, patients in your clinic or staff in other divisions of the university from these groups . That’s interesting but we kind of want staff members to be involved who, say, can read and write Braille, if you are working with people who are blind, or who have connections in the African-American community, if you are recruiting a large sample of African-Americans. So, unless those people you mention are actually working on this project, it doesn’t get you points.
9. Don’t propose a research program that requires a large amount of data analysis and have .o5 FTE for a statistician. That is two hours a week. I don’t care how your budget is. That’s not enough. If you even THINK it might be not insane to propose that, you are required to watch the video in this link three times.
10. Don’t have 40 pages of need, literature review, aims/ hypotheses, personnel, program design and then do some vague hand-waving when it comes to analysis. I have actually read proposals where the analysis section said, “We will do descriptive and inferential statistics.” As opposed to what? Crystal ball gazing?
There are more than ten things not to do, but these are a start. I’d rant more but I need to get back to work AND the rocket scientist is hogging all of the Sauvignon Blanc. Something must be done about this.
I’d always found the SPSS help pretty basic, so I don’t even consider it when looking for information. However, courtesy of the UCLA Academic Computing Group, which has a bunch of the SPSS case studies on-line, I found this one on mixed models.
It really is one of the most straight-forward explanations of mixed models, fixed effects, random effects and repeated measures I have found, complete with descriptions of how to use model fit statistics.
Unfortunately, all of the examples come from business. I teach primarily education students and the design in the SPSS examples doesn’t fit most of their studies.
In most of the case studies, the examples of random effects are markets. To remind you, in case you forgot, a random effect is one selected at random from a population of “levels”, which carries with it its own variance. For example, if you select ten media markets at random, and mail people at random one of four types of coupons, then measure sales in the areas, MARKET is a random effect and TYPE OF COUPON is a fixed effect.
This is modeled fairly easily using SPSS – go to ANALYZE > MIXED MODELS > LINEAR
Pick market as your subject, sales as your dependent, type of coupon as your factor – well, hell, just follow the link above or if you have SPSS installed, under HELP, select CASE STUDIES, and pick the Advanced Statistics option and then Linear mixed models.
That’s not really the type of random effect we have in a lot of educational studies, though. Often what we have is TEACHER or CLASSROOM and unlike market, we don’t give each teacher five different interventions. That would be nice and it is a great way to control for the teacher variance. Similar with the medical study with the cross-over design. Great idea. Great way to control for the clinic or physician variation. You have four clinics and they get randomly assigned to one of four treatments the first week, and then each moves up one the second week (the one with Treatment A goes to Treatment B, the one with B goes to C and so on) At the end of four weeks, each clinic has tried each treatment.
That’s not how it works in schools, though. Teachers aren’t repeated.
In teaching my data analysis class this fall, I want to start with simpler models and work to more complex ones. I’ve plenty of plain vanilla general linear model examples and some very complex generalized mixed models and nothing in between. The SPSS ones would be perfect if they were relevant to education.
I have other work to do and have spent all of the time I can spare on mixed models tonight. Maybe this weekend I can continue my search for good examples of mixed models in education. If you have any good suggestions, hit me up.
“I’m sure you weren’t expecting this …”
… it’s usually an even worse sign.
In this case, however, he had called to say that The Spoiled One was being recommended for a scholarship to a boarding school. We had never considered boarding school for five seconds. She’d already been admitted to a private, Catholic girls school with a stellar academic reputation. It was the school we had picked out for her two years ago.
She’d done the campus visit, studied for her high school placement test, gone for her interview and gotten admitted. We’d even mailed in the registration fee. It was all set.
So … I took a deep breath and said,
“We will go visit. That is as far as I am willing to commit.”
When I told the rocket scientist about this, he looked as if he had been kicked in the stomach. He had waited 42 years to have a child and wasn’t at all ready to have her living somewhere else.
Well, we did go visit. The school is in a drop-dead gorgeous setting. The student-teacher ratio is 10 to 1. The average SAT scores are 200 points above the national average. But still, she would be moving into a dorm, at fourteen years old.
Impatient with all of the agonizing and discussing, The Spoiled One stamped her feet for attention and said,
“I think you people are under-estimating me.”
It’s funny how many people now when we tell them that we are sending our child to boarding school respond with a long pause, unsure what to say. Some have even gone as far as to ask if anything is wrong. We’re not sending her to rehab, for Christ’s sake!
In fact, one reason we were willing to entertain the idea of her living away from home five days a week is that she hasn’t ever gotten into any real trouble. Right now, she has a 4.0 GPA for the semester, she’s made honor roll every year, been on the student council for three years, plays soccer four or five times a week.
Boarding school is not the typical choice in this country and she won’t be going to school with a lot of typical kids. In fact, there are kids from over twenty countries living in the dorms.
The more we talked about it and visited, the more we could see her fitting in there. It’s interesting, when I read posts from other west coast parents they all say the same thing – they never considered boarding school, but when their child brought it up and they looked into it, it seemed like the right choice.
So, no, there is nothing “wrong”.
Are we nervous? Of course we are nervous! I still can’t believe it wasn’t just last week that she was eight years old making her First Communion.
I have tried very hard never to hold my children back in life. I have taken a deep breath and put my 17-year-old daughter on a plane to fly across the country and enroll in New York University, which I had never even seen, but which sucked up half of my income for 3 1/2 years.
I took another deep breath and sent my 16-year-old daughter to Boston to live with the best judo coach in the country, so she could train for the Olympics.
I drove my nineteen-year-old daughter to San Francisco to start her junior year of college at San Francisco State. Two years after she graduated, I drove her back to LA to start her masters program at USC.
So, yes, I’m used to taking a deep breath and letting go.
As the saying goes,
“There are two things we must give our children. One is roots, and the other is wings.”
Want to start a business? Think you’re a good programmer? Have a good design? (By a good design, I don’t mean an idea, I mean pages of scene, character and action description if it’s a game, plus – well, let’s just say one HELL of a lot more than “there should be an app for that.” But that’s a different post.)
Do all of your friends tell you that you should start your own business?
I have a few friends who also own businesses and what we all have in common is the experience of having to cover the payroll with checks written on our credit cards.
The most I have ever been in debt is $87,0000 . At the time, we had work coming in, but our accounts receivable was increasing, not our bank balance. We had a couple of federal grants, a few commercial clients and everyone was either late paying or waiting for funds to be signed off by Agency X so they could be released to System Y and then drawn down into our compant account.
You know those stories you read about people who maxed out their credit cards to start a business? Well, we were close – but then we paid it down to zero.
As a result, I added it up today and, between the two of us, the rocket scientist and I have enough credit to buy a couple of houses in Florida (or a quarter of a house in Santa Monica!)
More recently, we started a project where we are still waiting for the release of funds, which means that the newly retired rocket scientist has been working twenty hours a week, for months, for no pay, while I’ve been working another twenty hours a week, at no pay, while doing work for actual money on top of that. In short, we have had the equivalent of a full-time very experienced programmer on this project for months. We also had to pay an accountant to do some of the financial stuff required on the contract.
We already had in place all the hardware we needed – one computer for each of us, plus older computer for testing compatibility, laptop for when I’m traveling, printer/ scanner. We have plenty of office supplies – paper, printer cartridges.
There were a couple of trips out of state required, to meet with key people, for research on site, that cost a few thousand dollars.
In short, if you’re thinking of starting a business, you’d better be able to get by on a very reduced salary, for months. You need to have equipment, supplies and money for other expenses specific to your business. Since we sell consulting services, we don’t have materials or inventory costs, but we do have a lot of travel costs to get from Point A to Point B where we are consulting, and hours to prepare whatever we are presenting. We often don’t get paid until we have come back from Point B.
You’ll probably have long periods where the payments from Project X three months ago are covering your bills while you do work for Project Y now. But not always.
No, this isn’t a call for more small business loans or a complaint that businesses can’t get started because of a lack of credit. In fact, I’ve had an account at one bank for 22 years and at another for 10 years. Since my credit line balance has been up, and then down to zero, several times over the years, they have kept increasing it.
I don’t blame the banks at all for being reluctant to loan money to new businesses. Why should they take all of the risks while you continue to draw a salary and could walk away unscathed if your product fails?
No, this is just a heads up on what to expect to people who are planning to start a business – and to people coming in late in the game to businesses like ours expecting to be given a share of the company in addition to their salary.
If you really want to start a business, be prepared to have some skin in the game.
There are some things to like about Statistica. The scatter plot matrix, for one. I’d done a sentiment analysis of a data set on blog posts (not mine). For each post, I had three variables
- number of negative sentiments expressed in the post,
- number of positive sentiments expressed in the post
- total number of comments that poster had made, ranging from 1 to over 1,000.
I thought people who comment a lot would be the ones who had the most negative comments, where there would not be as much of a correlation between positive comments and frequency.
I like the graphic output you get, which shows a frequency distribution for each variable and a plot for each pair. All at once you can get a sense of the strength of the correlation, whether it might be affected by restriction of range – as shown by a skewed distribution – or by outliers.
There seems to be an actual correlation between the number of positive comments and the number of negative comments. Also, positive comments outnumber negative comments almost three to one.
One might be tempted at this point to run out and say,
“Oh, look! Sentiment is very positive!”
Also, it appears that people who have more negative comments also have more positive comments, this means that ….
Just stop right there.
Before saying this means anything, you should go back and take a look at the comments being categorized as positive or negative. The first thing you will note is that computers are very poor at detecting sarcasm, subject changes and idioms. The data came from comments on blogs related to Apple computer products. Here are just a few of the cases where I disagreed with the computer.
- “Yeah, tell us you’ll improve conditions at your manufacturing plant in China. That would be great, wouldn’t it?” (Includes “improving” and “great” so counted as two positive sentiments).
- “I’d rather not say nice try, but … ” (Counts as one positive comment, with the word, “nice”)
- “Buy Windows! It’s superior” (Counts as one positive comment, with the word, “superior”)
- “Too bad I can’t buy it right now.” (Counts as negative, with the word “bad”)
I’m not saying that Statistica is bad - I don’t think it is – or that text mining is useless – I don’t think that, either.
What I DO think is that text mining has to be an iterative process. First, you get your results and then you examine them, make some changes – in this case I would start with the synonyms data – and you re-run your analysis.
Off to bed. I have to be up in six hours and head to the Black Belt Magazine studios for a photo shoot on our new book that is coming out this fall, Winning on the Ground: Championship matwork for judo, grappling and mixed martial arts.
It’s a bit of a leap from text mining, but, variety IS the spice of life.
The advantage of learning a new language is it sometimes makes you re-think the old languages you know. For example, here is a problem that happens often:
Some people are morons.
For example, say I were to ask you the following question:
“How old are you?”
YOU would probably answer something like, 42 or 21. You didn’t mistake that for an essay question, now, did you? That, my dear reader, is because YOU are not a moron. However, trust me when I tell you that other people are not as smart as you.
A rather annoying percentage of people enter responses along the lines of:
I am 47 years old.
I just turned 21. Happy Birthday to me.
87 (yes, eighty-seven)
and so on ….
Just using the sub-string function to read in the first two characters won’t work, obviously.
var age = prompt(“How old are you?”) ;
var ageyears = age.replace(/[\D]/g, ”);
Usually, in my SAS programs, I would either just define age as a numeric variable and all of those who included text had their values set to missing. Or, if I wanted to minimize missing data, I would write a statement to just read in the first two characters, or maybe to strip out “years” and “yrs”. However, in the latest data set I have, it seems to be a sample of people who are creatively annoying, so I had to settle for a lot of missing data or do something else. I got to thinking that there MUST be some function in SAS that does something similar.
Well, wouldn’t you know ….
Age_numonly = compress(age,'0123456789','K');
Having the ‘K’ at the end reverses what the COMPRESS function normally does and instead of deleting your numbers it keeps them. I don’t know how I did not know this. Maybe I knew it at one point and forgot it? Be sure you have the ‘K’ in quotes, by the way.
Well, now I have it stored in my blog, which is better than having it in memory, because unlike my memory, this blog gets backed up regularly.
Business is good and has been getting better all year for me. My friend Jake is doing great. He was an anesthesiologist and several years ago became board-certified in geriatrics. He loves working with older people and his patients love him.
When I travel around the country, though, or catch up with old friends, I find that is not true everywhere. Others have been unemployed either continuously, or on and off, for a period over the 99 weeks of unemployment.
“People I know” is hardly a random sample, despite what your average sophomore seems to think, but I still thought it would be interesting to look at the people I’ve known for a decade or more and see where our paths diverged. Because these were all people I have known for 10, 20, or 30 years, we all were at the same place at one point. So, what happened?
I’ll tell you what DIDN’T happen. No one who I know that is long-term unemployed or under-employed is lazy. These are people who have worked construction, cleaned houses, loaded trucks, worked in factories and put in twelve-hour days as middle managers. Also, as you can guess by that list, they also are people who don’t consider manual labor “beneath them”.
None of these people are stupid. Some speak two languages. Some have two or three years of college.
Here are three things that did happen, though. One is that they just got old and for those who had spent a lifetime doing physical labor, they just could not do it any more. Their knees, shoulders, hips, hands, back – you name it – gave out and they were unable to do physical work. These folks didn’t have the skills to do a desk job. Even taking a six-month training course on how to use a computer, word-processing, spreadsheets and social media only got them up to where my eighth-grade daughter is already. There aren’t a lot of positions for people with eighth-grade level computer skills.
A second thing that happened was they settled down. They had families. They married. They bought houses. When they lost their jobs, they had a husband or wife who still had a job. They had a mortgage to pay. When the factory closed, they couldn’t just leave town and let the husband/wife watch the kids, work and pay the bills while they went somewhere else and got another apartment and a new job.
Here is the big, big difference between the two groups of people, though- the people who are unemployed quit their education. They got comfortable as a COBOL programmer, teamster, regional manager. When that job was gone, it turned out there was not a real demand for the person who knew more about the blueprint archive at General Dynamics than anyone else in the world.
Why I Won’t Be Unemployed Five Years from Now
I needed to capture some audio so I downloaded audio hijack and invested 10 minutes in learning to use that. I also needed a voice over so I fiddled with Garageband for half an hour. I’d used that before but not lately. Every time I use it, it takes me less time to remember, “Oh yeah, that’s how you do that again.”
I needed to output some mp3 files as ogg files. I don’t even remember why I had audacity in my applications folder. I don’t *think* it came with my new computer. To export as an mp3 file I needed lame, which I also needed to download.
Most of my career has been spent processing structured data, specifically doing what is now sneeringly called “frequentist” statistics. Looking to update my syllabus for next year, I’ve been looking into data mining, both with SAS Enterprise Miner and Statistica. I was able to download the trial version of Statistica and the On-Demand version of SAS Enterprise Miner.
So …. this is what I have been up to in the last month. Some of those things will not pan out. At one point, I was pretty good with Tel-a-Graf (graphic design software for plotters – yes, plotters), Foresight – another programming language which I don’t think is around any more. I used Lotus Notes and learned SAS FSP (for “Full Screen Product”). I’ve used a VAX, IBM, DEC, Franklin Ace, Lisa and Next computer.
I think I have identified the dividing line between those whose careers stayed on an upward trend all of these years.
Or, to be specific, it’s the idea of Johnny Appleseed. The way I see it, each new thing I learned is like scattering some appleseeds. Most of them will probably get eaten by birds, fall on rocks or be bought by Microsoft and killed off. If you toss enough seeds around, though, some of them will bear fruit and twenty years later, you’ll have people lined up to pay you for your knowledge of apples.
So I say,
“There may be some useful information in the text fields. For example, people who use credit to buy commodities, such as meat may differ from people who buy finished goods like clothing who may differ from people who purchase machinery on credit. Perhaps you may want to consider some kind of clustering.”
The very bright young people nod and one says quite brightly to another,
“Well, you better get out your grep statements.”
All of the rest nod in pleased agreement, while I say to my old and faded self,
Grep? That’s what you’ve got? Seriously? What. The. Fuck.
Okay, ten points for the bright young people for knowing any Unix commands or even that Unix exists, which puts them ahead of a lot of people. However, please explain to me why in the name of God you would not even consider using something like Statistica or SAS Enterprise Miner ? There is an R text mining package. I have never used it because I don’t use R (long discussion of that here) but these young people had spent three semesters learning to program R and did not even know it existed.
At one point I was playing around with both Ruby and SAS to write a program to parse text. Do you have any idea how much of a pain in the ass that is? In that case, because it was on a set of data with a VERY limited scope, we could do it by using just a few hundred words. It was a small project with a very small budget and at the time I was wanting a project that gave me an excuse to learn Ruby.
For a more general project with the whole English language as its scope, that would be an insane undertaking. It would cost the client several times the cost of buying a SAS license or Statistica (not sure about the SPSS offering) – and what I could write would not be within shouting distance of as well done and comprehensive as something a team of people had worked on for years.
The most recent client who asked me this actually has a SAS Enterprise Miner license at their organization! (So, yes, while the license fee is humongous, since it had already been paid, the additional cost to use it on this project would be zero.)
I started on it and after about ten minutes of reflection realized there were probably dozens of jQuery plug-ins that did slide shows of every size, shape and form. Sure enough, 5 seconds on Google gave me a couple to download.
I downloaded one, modified it a bit and it was okay, though I’m not sure it is exactly what I want either. When I looked at the code in detail, it was evident the author had done the same as me, downloaded someone else’s code and modified it, because there were entire directories in there that did nothing. So, I deleted those.
After playing with that for a while, I thought perhaps there were other, better, slideshow plug-ins available. I downloaded another one because, even though I knew it probably wouldn’t suit my purpose, it was written so much more succinctly, I found it interesting.
So …. two lessons
- Don’t waste your time creating something that has already been created.
- Even if you do want to create something, either just for the hell of it or as a learning experience, you’ll probably learn a lot and end up with a better product, faster, if you build on what other people have already done rather than start with a blank page and Notepad++
No, this is not a post about politics or life-hacking, although the same title could apply in either case. I am talking about statistical power. People often ask me what the power of a test is, but the problem is that they are asking the wrong question. Power is not a single number. I understand where the confusion can occur.
What is power & how do you get it?
There are two errors people worry about, Type I and Type II. The probability of making a Type I error is set and it is called alpha ( α ) . Alpha is usually set at .05. It is the probability of rejecting a true null hypothesis. Now, what is a null hypothesis? It is a hypothesis of ZERO difference between the means, ZERO relationship between X and Y. A Type I error can occur in the case of ONE number, zero. If the effect is zero and you say it isn’t, you have made a Type I error.
A Type II error is the probability of accepting a false null hypothesis. The probability of a Type II error is called beta (β ). A Type II error can occur in an infinite number of cases, for any number other than zero. If the effect isn’t zero and you say it is, you have made a Type II error. Power = 1 – β .Depending on what the actual value of your statistic is, the power will be different.
Look at it logically. If in an infinite population your experimental group is a million times better than your control group, then, just logically, the probability of you pulling two samples and incorrectly deciding there was no difference is very low. Similarly, if your experimental group performs .01% better than your control group, although the difference is not zero, you can logically conclude that a good percentage of the time you might conclude that there is zero difference, which is, incorrect statistically, although perhaps not for practical purposes.
Dr. Park, at the University of Indiana, has a very nice explanation of hypothesis testing and power analysis. He says, assume that we are testing the hypothesis that the mean is 4 when in actuality the mean is 7.
Let’s just say we are hypothesizing that people feed their office guinea pigs hay an average of four times a month. (I had to do something with the office guinea pigs to make them feel part of the team, so I put them in here.)
This variable is normally distributed with a standard deviation of 1. (Just a reminder that the standard deviation OF THE MEAN is the standard error.)
The cut off for rejecting our hypothesis is 5.96 because computing a z score, we get 5.96 – 4/ 1 = 1.96 since at 1.96, p is not less than .05, p = .05. So, 5.96 is the highest number at which we accept the hypothesis that the mean = 4.
This hypothesis is, in fact, wrong. People really feed their office guinea pigs hay a mean of 7 times per month.
We know this because God told us so in a spare moment when he was not busy telling Republicans they needed to become candidates for president.
Given that the true mean is 7, we can compute the z- score for
5.96 – 7 /1 = -1.04
We look up 1.04 in a z table because, although we have a direct line to God, we don’t have a calculator with statistical functions, and we find that about 15% of the time we’ll get a value of 1.04 or greater. (14.92% of the time, actually, if you’re a precision freak).
So, this tells us IF we hypothesize the mean is 4 but it really and truly honest-to-God is 7, and IF the standard error is 1, then our power is .85 because 15% of the time we’ll get a number at least as large as 1.04.
So, our power is .85, right?
Well, not so right. It is – IF the standard error is 1 and IF the “true” value is 7 and IF we were doing a z-test. But if we knew the true value, what was the point of doing any tests?
What if the true value is 6? Then z = 5.96 – 6 / 1 = .04 . The percentage of the z-distribution (which is normal) that is greater than .04 is about 50%, so our power is around .5o
Important point number one – power depends on the true value, and you don’t know the true value
This is the first important point to keep in mind …. the power of a test is different based on what the true population value is. But you don’t really know what it is, since God is too busy worrying if people are having gay sex or eating pork to talk to you about guinea pig cuisine.
Generally what people do (if “what people do” means what I do), is enter a number of possible values into software like PROC POWER. So, I enter 6, 6.5, 6.75 and 7 and find that the values for power are .51, .70, .78 and .85 I can say that the power of the test is at least .85 if the true mean number of office guinea pig hay purchases is 7 per month or higher. That is, when the true figure is at least 7 we would reject the false null hypothesis at least 85% of the time. If it was a lot more than 7, we’d reject it a lot more than 85%.
Important point number two – power depends on the variability
In the example above, I forced the standard error to equal one by assuming my standard deviation was 10 and my sample size was 100. That isn’t very realistic, but I was just going with the example in his paper. Let’s say instead that the standard deviation is 1, which is more reasonable, and the sample size is 10. Then my standard error .10 and the power is going to be greater than .999.
Important point number three – power depends on the test statistic
A z-test is a test where you compare the mean to a constant value. Generally, you don’t have a constant value. More likely, you have two groups. Say, we want to know if office guinea pigs get hay as often as home guinea pigs. My hypothesis is that the office guinea pigs will get it seven times a month, because they need more energy to keep up with their official duties, while the home guinea pigs will only get hay four times a month. I select a total sample of 10 with only 5 in each group, because I want an equal number and for some reason it is difficult to locate people who have office guinea pigs . The standard deviation of number of times of hay per month is still 1. When I compute the power of this test, it was .985.
SO …. even if you go with the standard .05 level of significance (level of significance ALSO affects power) and the standard two-tailed tests (whether you have a one or two-tailed test ALSO affects power) and you don’t have to bother about correlations between groups (the correlation between groups in a paired t-test ALSO affects power) you STILL can have a whole bunch of numbers that MAY be the power of the test depending on what the test statistic, variability and hypothesized value are.
The one thing that affects power people usually ask about is sample size
Yes, sample size also affects the power of a test. So, if I only had 4 guinea pigs per group, my power would be .939. If I had 10 guinea pigs in each group, it would be above .999
However, if you ask me to tell you how many people you need in your sample to have a power of .80, you’re asking the wrong question. The answer depends on how large of an effect size (in these examples difference between means), how much variability, the specific statistical test you are doing and other factors like whether it is a one or two-tailed tests and correlations between your groups.
The best answer you are going to get from me is that if you have 128 people total in your sample you will have power of AT LEAST .80 IF you are doing an independent t-test if there is AT LEAST a half-standard deviation difference between the two groups, AND you are doing a two-tailed AND your null hypothesis is that there is zero difference between the groups. However, if there is smaller difference than that it will be less. Also, if you are doing a different test, say, a logistic regression, power calculation is more complicated.
But I know that you are going to nod knowingly, turn around and walk out the door saying,
“128. Got it. Thanks!”
In the past, when I had to do any type of parsing of text, I wrote my own code with a zillion SUBSTR functions and IF statements and it did the job but it was *so-o-o ugly and painful that I never even considered including text mining in any courses I taught.
I looked into SAS Enterprise Miner years ago but the commercial version costs (and this is approximate) $1,278,544,899,711,315 and your left kidney.
The SAS On-Demand version sucked. You know how some programs you can get a cup of coffee while waiting for them to run? With the original SAS On-Demand for Enterprise Miner you could fly to Columbia, work as a day laborer to earn the money to buy land, start your own plantation, breed a strain of genetically superior coffee beans and skip the country on the last plane out just before the latest government coup nationalizes your business – and your results STILL wouldn’t be available when you got back.
Having had such good luck with SAS On-Demand for Enterprise Guide last semester, I thought I’d give Enterprise Miner another look.
Last year, The Spoiled One was in the living room with her boring parents, complaining they were watching The Daily Show with boring news when it turned out that Justin Bieber was the guest.
She must have felt like this.
The latest version is unbelievably faster. I cannot tell you if it is better because it ran so slow in the past it was impossible to tell. It is easy to use. Let me give an example.
First, you register with SAS On-Demand and register a course for use with Enterprise Miner. This is really easy.
Second, you start Enterprise Miner which requires nothing more than clicking on the Get Software link on your log in page.
Next, create a project. Just go to FILE > NEW > Project and click next a lot. A long the way you give it a name. It’s pretty obvious.
It may not be obvious that you need to have a data source available and create a diagram. Again, it’s pretty easy to figure out, though.
Creating a data source – go to FILE > NEW > DATA SOURCE
a window pops up and the default is SAS TABLE, which is what you want if your data is in a SAS dataset (they now call them tables. I blame the damn SQL people.). Click Next
In the next window, you browse to where your data are. Because I am just testing this for use in a class, I used the abstract data set in the Sampsio library.
So, you have a project, a blank diagram and a data source. Now what?
1. Drag the icon under data sources on to your diagram
2. Click on the Text Mining Tab
3. Click on the Text Parsing tab (hovering over each tab with the mouse will give you its name) and drag it to the diagram
4. Click on the little grey stem sticking out of the end of your data source and drag it to the text parsing box.
5. Now, right- click on the Text Parsing box and from the drop-down menu, select RUN
After a bit, it will come up with a window that has two choices, OK and Results. Click on Results. The most interesting bit in the results, I think, is the table of frequency for each word. You can see which words are most common in your documents.
STOP WORDS AND OTHER OPTIONS
This is just the beginning, of course. As you can imagine, if you had to actually write a program to read every word separately, that would take a bit of time. Far more time would be to have it ignore words that are useless, like, “the”, “that”, “there”. These are called stop words. Enterprise Miner has a stop list and you can add or delete words from it.
Click on the thing that looks like a page to add a row and type in another stop word. For example, these abstracts come from the SAS Global Forum proceedings so they probably all have some words like data and SAS that occur in every one of them, so in this case, that is pretty useless as far as analyzing the documents. You can add those to your stop list.
If there is a word you want to keep, you can remove it from the stop list by selecting it and clicking that X at the top (right next to the thing that looks like a new page). You’ll be asked if you are sure you want to delete that row.
How do you get the stop row list, you may ask, quivering with excitement.
the language to use,
a list of multi-word terms, everything from “a lot” to “keep in mind” to “zero in”,
parts of speech to ignore, like adjectives, and, of course,
the stop list.
To modify any of these, just click on the three dots next to it and a window will pop up, like the one shown above for the stop list.
If you haven’t actually had to do analyze text data before, you have NO IDEA how amazingly awesomely cool this all is.
When I was in graduate school, we would actually print out multiple copies of the documents, cut the pages into paragraphs and sort them into categories.
More recently, this is why I started using Ruby because it was much easier to parse text than using SAS. There were some cheaper and open source solutions that I looked at but their documentation was non-existent, the interfaces were clear as mud.
The Bad and Good News
Speaking of unclear interfaces … I’m not sure I would have guessed that the page with the corner folded meant “add new row”. Also, there is a LOT of stuff on the Enterprise Miner screen. You have all of these different panes in the window and the options in them are completely different depending on whether you have clicked on the text mining tab, the text parsing box or something else. I’ve read a couple of data mining books, one specifically on Enterprise Miner, and they still were very sparse, particularly in their treatment of text mining, which is what I was most interested in.
That’s the bad news. The good news is that when I was at SAS Global Forum, I picked up a copy of Practical Text Mining. I almost didn’t buy it because it’s over 1,000 pages and my suitcase was already pretty full, which meant I’d have to lug it through the airport. Even worse, it did not have an electronic version, which is tough for me because even with contacts and glasses worn OVER my contacts, I still have difficulty reading some of the screen shots in it. (I expect if I had normal eyesight, I’d be fine.)
All that being said, this book is really useful. I know I got a discount at the conference, but still, it was about $70, which for a textbook like this is super-cheap. A thousand pages sounds like a lot, but that’s because it starts with the very basics and is a bit redundant. That’s not so terrible, though because that makes it easy to read. I was laying in bed sick this morning and read the first 120 pages in about two hours.
This is a godsend to anyone doing a qualitative dissertation. The real tragedy is that a lot of people in areas that do qualitative research – education, psychology, nursing, social work, to name a few – probably won’t even be aware that Enterprise Miner exists, much less that they can get it for free to use in teaching their courses.
Seriously, people, this is a huge opportunity for you to teach your students about text mining and it’s really not that hard.