statistics

Text Mining with Statistica (or anything else) – look again!

ByAnnMaria De Mars May 16, 2012May 16, 2012

There are some things to like about Statistica. The scatter plot matrix, for one. I’d done a sentiment analysis of a data set on blog posts (not mine). For each post, I had three variables

number of negative sentiments expressed in the post,
number of positive sentiments expressed in the post
total number of comments that poster had made, ranging from 1 to over 1,000.

I thought people who comment a lot would be the ones who had the most negative comments, where there would not be as much of a correlation between positive comments and frequency.

I like the graphic output you get, which shows a frequency distribution for each variable and a plot for each pair. All at once you can get a sense of the strength of the correlation, whether it might be affected by restriction of range – as shown by a skewed distribution – or by outliers.

There seems to be an actual correlation between the number of positive comments and the number of negative comments. Also, positive comments outnumber negative comments almost three to one.

One might be tempted at this point to run out and say,

“Oh, look! Sentiment is very positive!”

Also, it appears that people who have more negative comments also have more positive comments, this means that ….

Just stop right there.

Before saying this means anything, you should go back and take a look at the comments being categorized as positive or negative. The first thing you will note is that computers are very poor at detecting sarcasm, subject changes and idioms. The data came from comments on blogs related to Apple computer products. Here are just a few of the cases where I disagreed with the computer.

“Yeah, tell us you’ll improve conditions at your manufacturing plant in China. That would be great, wouldn’t it?” (Includes “improving” and “great” so counted as two positive sentiments).
“I’d rather not say nice try, but … ” (Counts as one positive comment, with the word, “nice”)
“Buy Windows! It’s superior” (Counts as one positive comment, with the word, “superior”)
“Too bad I can’t buy it right now.” (Counts as negative, with the word “bad”)

I’m not saying that Statistica is bad – I don’t think it is – or that text mining is useless – I don’t think that, either.

What I DO think is that text mining has to be an iterative process. First, you get your results and then you examine them, make some changes – in this case I would start with the synonyms data – and you re-run your analysis.

Off to bed. I have to be up in six hours and head to the Black Belt Magazine studios for a photo shoot on our new book that is coming out this fall, Winning on the Ground: Championship matwork for judo, grappling and mixed martial arts.

It’s a bit of a leap from text mining, but, variety IS the spice of life.

statistics

Incidence Rate: An example with Down Syndrome

ByAnnMaria De Mars January 19, 2016January 19, 2016

How common is a particular disease or condition? It depends on what you mean by common. Do you mean how many people have a condition, or do you mean how many new cases of it are there each year? In other words, are you asking about the probability of you HAVING a disease or of you GETTING…

statistics | Technology

SAS Studio Tasks: Awesomeness You Might Be Missing

ByAnnMaria De Mars April 26, 2015

I finally am getting around to something in SAS Studio that I think is really cool – the tasks. Although they don’t look identical to SAS Enterprise Guide just because the screen layout is a little different, these are really, really similar to the tasks you would see in EG. If you are using this…

statistics

The rumors of our sucking at math have been greatly exaggerated

ByAnnMaria De Mars May 27, 2011May 27, 2011

Question authority. Whenever I hear authoritative statements made that don’t fit with the world I see around me, I try to follow up. How many times have we been told that the U.S. is just terrible in math, we are falling behind educationally, China and India are eating our lunch – deservedly so, because our…

Software | statistics | Technology

Do Men Want to Marry Smart Women? An Answer with SAS On-Demand

ByAnnMaria De Mars March 10, 2012March 12, 2012

Last year, one of my very young doctoral students (who was single), commented in class that she was sure women with more education were less likely to get married. Two older women in the class agreed that was probably true because women with more education were less likely to settle for just any man who…

statistics

Phi coefficients, Christmas and the number 42

ByAnnMaria De Mars January 3, 2009January 3, 2009

People like familiarity. That’s probably one reason we enjoy the holidays so much – we know all the words to Silent Night, how to carve a turkey, which of the Christmas cookies taste the best. If I am going to convince you to give up statistics with which you feel comfortable, such as chi-square and…

Dr. De Mars General Life Ramblings | Software | statistics | Technology

And … SAS Enterprise Miner is Running on Boot Camp

ByAnnMaria De Mars June 3, 2014June 10, 2014

Thank you to Jason Kellogg from SAS Technical Support, SAS On-Demand Enterprise Miner is now running on my Mac using Windows 8.1 with boot camp. Here were his instructions. Note, this is after you have a SAS profile, registered a course, changed the security settings in Java, now you are here The steps are: 1….

Similar Posts

Leave a Reply