More notes from the text mining class. …
This is the article I mentioned in the last post, on Singular Value Decomposition
Contrary to expectations, I did find time to read it, on the ride back from Las Vegas and it is surprisingly accessible even to people who don’t have a graduate degree in statistics, so I am going to include it in the optional reading for my course.
Many of these concepts like start and stop lists apply to any text mining software but it just happens that the class I’m teaching this fall uses SAS
In Enterprise Miner, you can only have 1 project open at a time, but you can have multiple diagrams and libraries, and of course, zillions of nodes, in a single project
In Enterprise Miner, can use text or text location as a type of variable. Documents < 32K in size can be contained in project as a text variable. If greater than 32K, give a text location.
- start lists – often used for technical terms
- stop lists, e.g. articles like “the”, pronouns. These appear with such frequency in documents they don’t contribute to our goal which is to distinguish between documents. May also include words that are high frequency in your particular data. For example, mathematics, in our data, because it is in almost every document we are analyzing
Multi-word term tables – standard deviation is a multi-word term
Importing a dictionary — go to properties. Click the …. next to the dictionary (start or stop) you want to import. When it comes up with a window, click IMPORT
Select the SAS library you want. Then select the data set you want. If you don’t find the library that you want, try this:
- Close your project.
- Open it again
- Click on the 3 dots next to PROJECT START CODE in the property window
- Write a LIBNAME statement that gives the directory where your dictionaries are located.
- Open your project again
[Note: Re-read that last part on start code. This applies to any time you aren’t finding the library you are looking for, not just for dictionaries. You can also use start code for any SAS code you want to run at the start of a project. I can see people like myself, who are more familiar with SAS code than Enterprise Miner, using that a lot.]
Filter viewer – can specify minimum number of documents for term inclusion
Speaking of Las Vegas, blogging has been a little slow lately since we took off to watch The Perfect Jennifer get married. It was a very small wedding, officiated by Hawaiian Elvis. Darling Daughter Number Three doubled as bartender and bridesmaid then stayed in Las Vegas because she has a world title fight in a few days.
Given the time crunch, I was particularly glad I’d attended this course that gave me the opportunity to draft at least one week’s worth of lectures in the fall. When I finish these notes, my plan is to to edit them and turn it into the last lecture in the data mining course. If it’s helpful to you, feel free to use whatever you like. I’ll try to remember to post a more final version in the fall. If you have teaching resources for data mining yourself, please let me know.
My crazy schedule is the reason I start everything FAR ahead of time.