statistics

Text Mining with Statistica (or anything else) – look again!

ByAnnMaria De Mars May 16, 2012May 16, 2012

There are some things to like about Statistica. The scatter plot matrix, for one. I’d done a sentiment analysis of a data set on blog posts (not mine). For each post, I had three variables

number of negative sentiments expressed in the post,
number of positive sentiments expressed in the post
total number of comments that poster had made, ranging from 1 to over 1,000.

I thought people who comment a lot would be the ones who had the most negative comments, where there would not be as much of a correlation between positive comments and frequency.

I like the graphic output you get, which shows a frequency distribution for each variable and a plot for each pair. All at once you can get a sense of the strength of the correlation, whether it might be affected by restriction of range – as shown by a skewed distribution – or by outliers.

There seems to be an actual correlation between the number of positive comments and the number of negative comments. Also, positive comments outnumber negative comments almost three to one.

One might be tempted at this point to run out and say,

“Oh, look! Sentiment is very positive!”

Also, it appears that people who have more negative comments also have more positive comments, this means that ….

Just stop right there.

Before saying this means anything, you should go back and take a look at the comments being categorized as positive or negative. The first thing you will note is that computers are very poor at detecting sarcasm, subject changes and idioms. The data came from comments on blogs related to Apple computer products. Here are just a few of the cases where I disagreed with the computer.

“Yeah, tell us you’ll improve conditions at your manufacturing plant in China. That would be great, wouldn’t it?” (Includes “improving” and “great” so counted as two positive sentiments).
“I’d rather not say nice try, but … ” (Counts as one positive comment, with the word, “nice”)
“Buy Windows! It’s superior” (Counts as one positive comment, with the word, “superior”)
“Too bad I can’t buy it right now.” (Counts as negative, with the word “bad”)

I’m not saying that Statistica is bad – I don’t think it is – or that text mining is useless – I don’t think that, either.

What I DO think is that text mining has to be an iterative process. First, you get your results and then you examine them, make some changes – in this case I would start with the synonyms data – and you re-run your analysis.

Off to bed. I have to be up in six hours and head to the Black Belt Magazine studios for a photo shoot on our new book that is coming out this fall, Winning on the Ground: Championship matwork for judo, grappling and mixed martial arts.

It’s a bit of a leap from text mining, but, variety IS the spice of life.

20 Day Blogging | Software | statistics | Technology

What I totally will do again, and what I wish I didn’t have to
ByAnnMaria De Mars January 27, 2014January 27, 2014

Today I’m getting around to day nine of the 20-day blogging challenge while I wait for The Invisible Developer to get out of the shower where he is curled in a fetal position whining about having to go outside when it is 14 below zero. Actually, he is probably just taking a shower, but lots…

Read More What I totally will do again, and what I wish I didn’t have to
Software | statistics | Technology

More adventures with SAS web editor
ByAnnMaria De Mars June 4, 2013June 4, 2013

After dropping The Spoiled One off at the beach and setting a personal best for calls returned, I’m back on setting up data sets, assignments and more for the fall semester. So ….. I tried using the UPLOAD option from the menu in the SAS Web Editor and that was a sad failure. Next, I…

Read More More adventures with SAS web editor
Software | statistics

My Reading Week Schedule, thanks to WUSS
ByAnnMaria De Mars October 15, 2011October 15, 2011

Anyone who claims to know all of SAS is clinically insane – Ernest Hemingway (not intended to be a factual attribution) Okay, I admit it, Hemingway didn’t really say that, but he would have, except for being dead and all. As usual, this year I didn’t have time to do everything I wanted at the…

Read More My Reading Week Schedule, thanks to WUSS
20 Day Blogging | statistics

Amos, covariances and variances: Twenty Day Blogging Challenge
ByAnnMaria De Mars January 6, 2014January 14, 2014

I came across this really interesting post on the 20-Day Blogging Challenge for teachers. I’m not sure how likely I am to be able to finish it in January since it is already the sixth and January is a really busy month for me, but we will see. The first prompt is “Tell about a…

Read More Amos, covariances and variances: Twenty Day Blogging Challenge
Software | statistics | Technology

Mixed models with SAS Enterprise Guide – Not Really
ByAnnMaria De Mars February 13, 2013February 13, 2013

I was going to use SAS Enterprise Guide 4.3 with SAS On-Demand to do my mixed model analysis, but it did not quite work out. First of all, if like me you are used to doing PROC GLM where each subject is one record, you have to change your dataset to be one where each…

Read More Mixed models with SAS Enterprise Guide – Not Really
statistics

The rumors of our sucking at math have been greatly exaggerated
ByAnnMaria De Mars May 27, 2011May 27, 2011

Question authority. Whenever I hear authoritative statements made that don’t fit with the world I see around me, I try to follow up. How many times have we been told that the U.S. is just terrible in math, we are falling behind educationally, China and India are eating our lunch – deservedly so, because our…

Read More The rumors of our sucking at math have been greatly exaggerated

Similar Posts

Leave a Reply