In the past, when I had to do any type of parsing of text, I wrote my own code with a zillion SUBSTR functions and IF statements and it did the job but it was *so-o-o ugly and painful that I never even considered including text mining in any courses I taught.
I looked into SAS Enterprise Miner years ago but the commercial version costs (and this is approximate) $1,278,544,899,711,315 and your left kidney.
The SAS On-Demand version sucked. You know how some programs you can get a cup of coffee while waiting for them to run? With the original SAS On-Demand for Enterprise Miner you could fly to Columbia, work as a day laborer to earn the money to buy land, start your own plantation, breed a strain of genetically superior coffee beans and skip the country on the last plane out just before the latest government coup nationalizes your business – and your results STILL wouldn’t be available when you got back.
Having had such good luck with SAS On-Demand for Enterprise Guide last semester, I thought I’d give Enterprise Miner another look.
Last year, The Spoiled One was in the living room with her boring parents, complaining they were watching The Daily Show with boring news when it turned out that Justin Bieber was the guest.
She must have felt like this.
The latest version is unbelievably faster. I cannot tell you if it is better because it ran so slow in the past it was impossible to tell. It is easy to use. Let me give an example.
First, you register with SAS On-Demand and register a course for use with Enterprise Miner. This is really easy.
Second, you start Enterprise Miner which requires nothing more than clicking on the Get Software link on your log in page.
Next, create a project. Just go to FILE > NEW > Project and click next a lot. A long the way you give it a name. It’s pretty obvious.
It may not be obvious that you need to have a data source available and create a diagram. Again, it’s pretty easy to figure out, though.
Creating a data source – go to FILE > NEW > DATA SOURCE
a window pops up and the default is SAS TABLE, which is what you want if your data is in a SAS dataset (they now call them tables. I blame the damn SQL people.). Click Next
In the next window, you browse to where your data are. Because I am just testing this for use in a class, I used the abstract data set in the Sampsio library.
So, you have a project, a blank diagram and a data source. Now what?
1. Drag the icon under data sources on to your diagram
2. Click on the Text Mining Tab
3. Click on the Text Parsing tab (hovering over each tab with the mouse will give you its name) and drag it to the diagram
4. Click on the little grey stem sticking out of the end of your data source and drag it to the text parsing box.
5. Now, right- click on the Text Parsing box and from the drop-down menu, select RUN
After a bit, it will come up with a window that has two choices, OK and Results. Click on Results. The most interesting bit in the results, I think, is the table of frequency for each word. You can see which words are most common in your documents.
STOP WORDS AND OTHER OPTIONS
This is just the beginning, of course. As you can imagine, if you had to actually write a program to read every word separately, that would take a bit of time. Far more time would be to have it ignore words that are useless, like, “the”, “that”, “there”. These are called stop words. Enterprise Miner has a stop list and you can add or delete words from it.
Click on the thing that looks like a page to add a row and type in another stop word. For example, these abstracts come from the SAS Global Forum proceedings so they probably all have some words like data and SAS that occur in every one of them, so in this case, that is pretty useless as far as analyzing the documents. You can add those to your stop list.
If there is a word you want to keep, you can remove it from the stop list by selecting it and clicking that X at the top (right next to the thing that looks like a new page). You’ll be asked if you are sure you want to delete that row.
How do you get the stop row list, you may ask, quivering with excitement.
the language to use,
a list of multi-word terms, everything from “a lot” to “keep in mind” to “zero in”,
parts of speech to ignore, like adjectives, and, of course,
the stop list.
To modify any of these, just click on the three dots next to it and a window will pop up, like the one shown above for the stop list.
If you haven’t actually had to do analyze text data before, you have NO IDEA how amazingly awesomely cool this all is.
When I was in graduate school, we would actually print out multiple copies of the documents, cut the pages into paragraphs and sort them into categories.
More recently, this is why I started using Ruby because it was much easier to parse text than using SAS. There were some cheaper and open source solutions that I looked at but their documentation was non-existent, the interfaces were clear as mud.
The Bad and Good News
Speaking of unclear interfaces … I’m not sure I would have guessed that the page with the corner folded meant “add new row”. Also, there is a LOT of stuff on the Enterprise Miner screen. You have all of these different panes in the window and the options in them are completely different depending on whether you have clicked on the text mining tab, the text parsing box or something else. I’ve read a couple of data mining books, one specifically on Enterprise Miner, and they still were very sparse, particularly in their treatment of text mining, which is what I was most interested in.
That’s the bad news. The good news is that when I was at SAS Global Forum, I picked up a copy of Practical Text Mining. I almost didn’t buy it because it’s over 1,000 pages and my suitcase was already pretty full, which meant I’d have to lug it through the airport. Even worse, it did not have an electronic version, which is tough for me because even with contacts and glasses worn OVER my contacts, I still have difficulty reading some of the screen shots in it. (I expect if I had normal eyesight, I’d be fine.)
All that being said, this book is really useful. I know I got a discount at the conference, but still, it was about $70, which for a textbook like this is super-cheap. A thousand pages sounds like a lot, but that’s because it starts with the very basics and is a bit redundant. That’s not so terrible, though because that makes it easy to read. I was laying in bed sick this morning and read the first 120 pages in about two hours.
This is a godsend to anyone doing a qualitative dissertation. The real tragedy is that a lot of people in areas that do qualitative research – education, psychology, nursing, social work, to name a few – probably won’t even be aware that Enterprise Miner exists, much less that they can get it for free to use in teaching their courses.
Seriously, people, this is a huge opportunity for you to teach your students about text mining and it’s really not that hard.