### Jun

#### 25

# Text Mining Notes from Awesome Free Course from SAS (Yeah, you read that right)

June 25, 2014 | Leave a Comment

Hot tip: If you are a professor, you have access to some major benefits from SAS. The main ones that jump to mind are:

- Free classes that are worth FAR more than you paid for them.
- Free software via SAS On-Demand.
- Free books – up to two per semester.
- Free teaching materials

You can get more information on the SAS Global Academic Program here.

Crazy, but true. I went to San Diego for two days (yes, I had to pay my own travel expenses, but with a Prius that’s $10 in gas and a night at a hotel room) and went to a free course on SAS Enterprise Miner. I have SAS Enterprise Miner free for a class I am teaching in the fall, and unlike desk copies, it’s not just free for the professor but for all of the students. I’m teaching data mining in the fall and although I really doubt we will get into text mining much, I think I may cover just an introduction in the last lecture. So, to remind myself, and for anyone else who might be teaching the same course, here are some of my notes.

**General **

Term-document matrix is a key concept in understanding SAS Text Miner (and probably any other text mining software) , columns are the documents, rows are the terms, like algebra, quotient, statistics

Of course, you are going to have plenty of 0 cells, where the document does not include the word, say”statistics”, and plenty columns that have many, many documents like, say, the word “mathematics”

According to the instructor text mining is a subset of text analytics. I always used them synonymously and we didn’t get into the distinction. Feel free to comment if you have an opinion, like that I should be burned at the stake for such text mining/analytics incest.

Using the filter in text mining works identically to a WHERE statement in an analysis in SAS , that is, it does not delete any observations from your data set but going forward in the analysis it only uses the records that match the filter (where statement)

Two general goals of data mining

- Pattern discovery – don’t have response variable. Trying to find variables that cluster together.
- Prediction

Kind of makes me think of statistics in general, where you have things like cluster analysis, factor analysis on one end and techniques like regression on the other.

People can manipulate a few inputs, but not everything, which is one way text mining can be used to identify fraud, by using large numbers of variables and looking for suspicious clusters. The whole fraud detection discussion of the course was pretty interesting, even though I’m not involved in credit card or insurance industries or other areas where it is such a big deal. I just found it intellectually interesting.

If you like matrix algebra (which I do), there was an interesting discussion of Singular Value Decomposition and the term document matrix. It seemed very much like principal components analysis, multiplying a vector of weights by a set of responses and an article was mentioned that distinguishes between SVD and PCA but to be truthful, I probably won’t find the time. I did end up discussing it with The Invisible Developer, though, who got a math degree at UCLA “because I thought as long as I was getting a degree in physics, I might as well”. We are well matched. This is the kind of career planning we go in for at The Julia Group.

**Topics vs terms**

Terms help define a topic.

Topic and category are not the same.

A document can be in only one category (cluster)

A topic can appear in multiple documents & a document can contain multiple topics

topic=concept , used interchangeably (at least as far as text miner documentation is concerned)

**Types of data sets**

Training, test and validation data sets are all based on historical data. You actually know what the value of the target variable is.

A scoring data set, you are trying to predict.

**General **

Transforming text to number options

- Boolean count – shows up or not
- Frequency counting
- Information theoretic counting (log of frequency counts)

Adjust for document size & corpus (number of documents) size -> term weights

- Entropy weights (Shannon information theory)
- Inverse document frequency weights
- Target-based weights
- Others

Can combine traditional data mining inputs with text mining inputs in a predictive model

…. I’ll post some more on specifics of how to use SAS text miner in my next post, but I wanted to point out two advantages for professors of taking a course, any course:

- It’s good to take courses to remind yourself what it’s like to not be the expert. So often, we get used to knowing all of the little nuances of a field and forget what it’s like to not find it obvious that the F value is the ratio of two estimates of variance, one obtained from between group differences and one from within groups. Back when I had slightly more time, I tried taking one course a year in something I knew nothing about, like microbiology. I learned interesting stuff and maintained more empathy.
- If you are lucky, you get to see good teaching modeled, and you can steal the instructor’s ideas. For example, in this class, it started out pretty slowly, but that was good because people who were not as familiar with data mining could get some understanding. It also was good that he defined a lot of the terms and basic concepts because I am just lifting some of that straight out of my notes for one of my lectures. (SAS not only allows this but they will encourage it and send you, free, instructional resources. If you are a professor, you only need to ask.) It was also good because by the afternoon of the first day, everyone was chomping at the bit to get their hands on the software and start running things, which would not have been the case if we’d started out using it right away. The less experienced people would have been lost and the more experienced people would have been bored after three hours of using it in the morning. I’m definitely stealing that idea for my class in the fall.

Here’s the other benefit I have found of courses, for professionals in general. Yes, you could maybe get all of the materials and read it in your spare time without going to San Diego or Cary or wherever. The fact is, that I would NEVER sit down and spend 16 hours in a week studying anything. I would get interrupted, have meetings, answer email, return calls.

Of course, if you are going to get a real benefit, you need to use it when you get back, which I have pretty much failed at. I will explain why next week (how is that for an air of mystery). In the meantime, the best I can do is review my notes so I’m ready to jump in next week.

Oh, and for those people who say that SAS only gives you free things because they want organizations to pay to use their software that students will be trained on – I’m sure that’s true. So?

# Comments

## Blogroll

- Andrew Gelman's statistics blog - is far more interesting than the name
- Biological research made interesting
- Interesting economics blog
- Love Stats Blog - How can you not love a market research blog with a name like that?
- Me, twitter - Thoughts on stats
- SAS Blog for the rest of us - Not as funny as some, but twice as smart. If this is for the rest of us, who are those other people?
- Simply Statistics, simply interesting
- Tech News that Doesn’t Suck
- The Endeavor -John D Cook - Another statistics blog