More notes from the text mining class. …
This is the article I mentioned in the last post, on Singular Value Decomposition
Contrary to expectations, I did find time to read it, on the ride back from Las Vegas and it is surprisingly accessible even to people who don’t have a graduate degree in statistics, so I am going to include it in the optional reading for my course.
Many of these concepts like start and stop lists apply to any text mining software but it just happens that the class I’m teaching this fall uses SAS
In Enterprise Miner, you can only have 1 project open at a time, but you can have multiple diagrams and libraries, and of course, zillions of nodes, in a single project
In Enterprise Miner, can use text or text location as a type of variable. Documents < 32K in size can be contained in project as a text variable. If greater than 32K, give a text location.
- start lists – often used for technical terms
- stop lists, e.g. articles like “the”, pronouns. These appear with such frequency in documents they don’t contribute to our goal which is to distinguish between documents. May also include words that are high frequency in your particular data. For example, mathematics, in our data, because it is in almost every document we are analyzing
Multi-word term tables – standard deviation is a multi-word term
Importing a dictionary — go to properties. Click the …. next to the dictionary (start or stop) you want to import. When it comes up with a window, click IMPORT
Select the SAS library you want. Then select the data set you want. If you don’t find the library that you want, try this:
- Close your project.
- Open it again
- Click on the 3 dots next to PROJECT START CODE in the property window
- Write a LIBNAME statement that gives the directory where your dictionaries are located.
- Open your project again
[Note: Re-read that last part on start code. This applies to any time you aren’t finding the library you are looking for, not just for dictionaries. You can also use start code for any SAS code you want to run at the start of a project. I can see people like myself, who are more familiar with SAS code than Enterprise Miner, using that a lot.]
Filter viewer – can specify minimum number of documents for term inclusion
Speaking of Las Vegas, blogging has been a little slow lately since we took off to watch The Perfect Jennifer get married. It was a very small wedding, officiated by Hawaiian Elvis. Darling Daughter Number Three doubled as bartender and bridesmaid then stayed in Las Vegas because she has a world title fight in a few days.
Given the time crunch, I was particularly glad I’d attended this course that gave me the opportunity to draft at least one week’s worth of lectures in the fall. When I finish these notes, my plan is to to edit them and turn it into the last lecture in the data mining course. If it’s helpful to you, feel free to use whatever you like. I’ll try to remember to post a more final version in the fall. If you have teaching resources for data mining yourself, please let me know.
My crazy schedule is the reason I start everything FAR ahead of time.
Hot tip: If you are a professor, you have access to some major benefits from SAS. The main ones that jump to mind are:
- Free classes that are worth FAR more than you paid for them.
- Free software via SAS On-Demand.
- Free books – up to two per semester.
- Free teaching materials
Crazy, but true. I went to San Diego for two days (yes, I had to pay my own travel expenses, but with a Prius that’s $10 in gas and a night at a hotel room) and went to a free course on SAS Enterprise Miner. I have SAS Enterprise Miner free for a class I am teaching in the fall, and unlike desk copies, it’s not just free for the professor but for all of the students. I’m teaching data mining in the fall and although I really doubt we will get into text mining much, I think I may cover just an introduction in the last lecture. So, to remind myself, and for anyone else who might be teaching the same course, here are some of my notes.
Term-document matrix is a key concept in understanding SAS Text Miner (and probably any other text mining software) , columns are the documents, rows are the terms, like algebra, quotient, statistics
Of course, you are going to have plenty of 0 cells, where the document does not include the word, say”statistics”, and plenty columns that have many, many documents like, say, the word “mathematics”
According to the instructor text mining is a subset of text analytics. I always used them synonymously and we didn’t get into the distinction. Feel free to comment if you have an opinion, like that I should be burned at the stake for such text mining/analytics incest.
Using the filter in text mining works identically to a WHERE statement in an analysis in SAS , that is, it does not delete any observations from your data set but going forward in the analysis it only uses the records that match the filter (where statement)
Two general goals of data mining
- Pattern discovery – don’t have response variable. Trying to find variables that cluster together.
Kind of makes me think of statistics in general, where you have things like cluster analysis, factor analysis on one end and techniques like regression on the other.
People can manipulate a few inputs, but not everything, which is one way text mining can be used to identify fraud, by using large numbers of variables and looking for suspicious clusters. The whole fraud detection discussion of the course was pretty interesting, even though I’m not involved in credit card or insurance industries or other areas where it is such a big deal. I just found it intellectually interesting.
If you like matrix algebra (which I do), there was an interesting discussion of Singular Value Decomposition and the term document matrix. It seemed very much like principal components analysis, multiplying a vector of weights by a set of responses and an article was mentioned that distinguishes between SVD and PCA but to be truthful, I probably won’t find the time. I did end up discussing it with The Invisible Developer, though, who got a math degree at UCLA “because I thought as long as I was getting a degree in physics, I might as well”. We are well matched. This is the kind of career planning we go in for at The Julia Group.
Topics vs terms
Terms help define a topic.
Topic and category are not the same.
A document can be in only one category (cluster)
A topic can appear in multiple documents & a document can contain multiple topics
topic=concept , used interchangeably (at least as far as text miner documentation is concerned)
Types of data sets
Training, test and validation data sets are all based on historical data. You actually know what the value of the target variable is.
A scoring data set, you are trying to predict.
Transforming text to number options
- Boolean count – shows up or not
- Frequency counting
- Information theoretic counting (log of frequency counts)
Adjust for document size & corpus (number of documents) size -> term weights
- Entropy weights (Shannon information theory)
- Inverse document frequency weights
- Target-based weights
Can combine traditional data mining inputs with text mining inputs in a predictive model
…. I’ll post some more on specifics of how to use SAS text miner in my next post, but I wanted to point out two advantages for professors of taking a course, any course:
- It’s good to take courses to remind yourself what it’s like to not be the expert. So often, we get used to knowing all of the little nuances of a field and forget what it’s like to not find it obvious that the F value is the ratio of two estimates of variance, one obtained from between group differences and one from within groups. Back when I had slightly more time, I tried taking one course a year in something I knew nothing about, like microbiology. I learned interesting stuff and maintained more empathy.
- If you are lucky, you get to see good teaching modeled, and you can steal the instructor’s ideas. For example, in this class, it started out pretty slowly, but that was good because people who were not as familiar with data mining could get some understanding. It also was good that he defined a lot of the terms and basic concepts because I am just lifting some of that straight out of my notes for one of my lectures. (SAS not only allows this but they will encourage it and send you, free, instructional resources. If you are a professor, you only need to ask.) It was also good because by the afternoon of the first day, everyone was chomping at the bit to get their hands on the software and start running things, which would not have been the case if we’d started out using it right away. The less experienced people would have been lost and the more experienced people would have been bored after three hours of using it in the morning. I’m definitely stealing that idea for my class in the fall.
Here’s the other benefit I have found of courses, for professionals in general. Yes, you could maybe get all of the materials and read it in your spare time without going to San Diego or Cary or wherever. The fact is, that I would NEVER sit down and spend 16 hours in a week studying anything. I would get interrupted, have meetings, answer email, return calls.
Of course, if you are going to get a real benefit, you need to use it when you get back, which I have pretty much failed at. I will explain why next week (how is that for an air of mystery). In the meantime, the best I can do is review my notes so I’m ready to jump in next week.
Oh, and for those people who say that SAS only gives you free things because they want organizations to pay to use their software that students will be trained on – I’m sure that’s true. So?
Maybe this is obvious, but I have often found that what is obvious to some people is not so obvious to others, so here are a few random tips.
1. Enterprise Miner can take a REALLY long time to load during which you wonder if anything is happening at all.
Open up the task manager and look for something that says javaw.exe *32 You can see it near the bottom in the image above. The number next to it should be going up, from 30,000 to 50, 000 etc. If it is, you should probably be patient for a few more minutes and your session will start.
2. Let’s say you want to change the properties of something. For example, I don’t want the data set to be partitioned into Training, Validation and Test in a 40, 30, 30 split. I want it to be 50, 50, 0. So, I right-click on the DATA PARTITION node, get a drop-down menu and
there is all of this stuff about Edit Variables all the way down to Disconnect Nodes, where the hell are the properties to change? They’re on the left, in that window with the title Property! Funny, but it’s so easy to focus on the diagram window and completely forget about everything else. Click on a node and it’s properties will show up in the window.
3. While the three screens you see when you run the StatExplore node are pretty interesting, it would be nice to have a more detailed look at your data. Just go to the VIEW menu and you can get more statistics, like the cell chi-square values, descriptive statistics of numeric variables broken down by the levels of your target variable.
After all of the effort to get Enterprise Miner installed, I thought it better do something good. It is interesting to use. Unlike programming where you can get a program to run but give you errors or unexpected results, so far (key phrase!), with Enterprise Miner I have found the problem to be knowing exactly what to select, for example, with CREATE DATA sources. Once you know that, however, it seems pretty hard to make an error.
Enterprise Miner does do some pretty cool stuff, which makes it worth the pain of getting it installed. Even way cooler, unlike back in the day when no one could get their hands on it without paying approximately $4,893,0893.16 , their first born child, their left kidney and an albino goat, if you are an instructor or a student, you can get it for free through SAS On-Demand for Academics.
(And, yes, for the record, I *am* aware that said goat is not an albino. I was fresh out of pictures of albino goats. Deal with it.)
Speaking of Enterprise Miner, I thought I would ramble on about the good parts for a few posts, since I’m getting ready to teach data mining in the fall and I hate to do anything at the last minute.
One of the good parts is StatExplore. At first glance, it looks good, but at second glance, it looks better.
All you need to do is create a diagram by going to the FILE menu, then selecting NEW and then DIAGRAM.
You can start by dragging a data source on to the diagram. In this example, I used the heart data set from the Framingham Heart Study, which happens to ship with Enterprise Miner in the SASHELP library.
I drag the data set from data sources to the diagram window.
Next, I click on the EXPLORE tab just above the diagram window. This gives you a bunch of icons. Enterprise Miner is just rife with icons. Never fear, though, if you have no idea what this bunch of colored boxes is supposed to mean versus that bunch, just hover over the icon with your mouse and it will tell you.
Here is my diagram. Simple, no? It gives you a bunch of cool stuff. First, you have the plot of chi-square values for all nominal variables.
You can see that sex has the highest chi-square (as in gender, not as in frequency of), followed by cholesterol status, smoking status and weight status. I find this rather surprising. I knew women lived longer than men, but with all of the discussion of obesity, I thought weight would be higher up there.
The next chart gives me the worth of each variable in predicting my target, which in this example is death.
The variable on the far left is age at start. Not surprisingly, the older people are when you start following them, the more likely they are to die in a given period of time. The next variable is Age at CHD Diagnosis, followed by two blood pressure measures, their cholesterol, then cholesterol status – weight status is down at the end.
This analysis produces A LOT of statistics. This, I found interesting because despite some people arguing Enterprise Miner allows analysis by someone without extensive programming or statistics background, certainly in the case of statistics, the more knowledge you have, the better you could make use of the results.
For example, in the top right (all three of the screen shots above are one screen, I broke them up at an attempt at legibility), the output pane gives descriptive statistics broken down by each level of the target variable. I can see how many people who died had missing data for age at CHD diagnosis, skewness and kurtosis values for variables by status, living or dead, the mode for weight status for people who were living or dead, and a whole lot more. Interestingly, 68% of the whole sample was overweight.
Scrolling through the statistics output I can get a good idea of the data quality – is it skewed, is it missing, is it missing at random.
Without some background in statistics, that’s probably no more than a bunch of numbers. Personally, I found it very helpful. That’s another assignment for the students, to write a brief summary of their data, including any concerns. There weren’t any real problems with these data except for the obvious fact that variables like cholesterol and cholesterol status,smoking and smoking status are going to be highly correlated. It would be a good idea to include one of those as input in any predictive analyses and reject the other to prevent multicollinearity problems.
(NOTE to self: Make sure to explain variable roles, changing variable roles in EM and multi-collinearity.)
You might think this is adequate for running just one node, but, in fact, there is much more here than meets the eye. More on that tomorrow because speaking of overweight, I have been at a computer for 13 hours today and I want to hope on the bike and get some exercise in before I knock out the last task I need to do today. Although @sammikes just pointed out on twitter that round is a shape, it is not the one I want to be in.
I’m putting this here for my students this fall, but I’m sure there are two or three other people in the world who would like to know how to use Enterprise Miner. I’m assuming you read some of my other posts or received an email from your professor or in other ways got Enterprise Miner installed and running.
If not, you should read the documentation. Or, you are welcome to poke around on this blog and find out what I did. Just type “miner” into the search box.
- Start Enterprise Miner
- Create a new project
- Give it a name
- Create a new library so you have some data – File > New > Library
- Type in a name and your course library, something like “/courses/yourschool.edu1/a_123/b_456”
- Create a new diagram – File > New > Diagram
- Create a data source (this strikes me as counter-intuitive, since I have the data source in the library, but whatever. Here is how you do it
- * Right-Click on the data sources tab
- * it will come up with a drop down menu with 1 option, create data source
- * pick that
- * It will come up with this window.
- Select SAS table, which is the exact same thing as a SAS data set
- * Click Next and it will bring up the list of libraries available including the one you just added in the last step
- * Double-click to select your library
- * Select your dataset and then
- Click OK
- The next few screens give you information on your data. In my course, the first assignment is for the students to use these to answer:
- How many variables in the data set
- How many observations
- .How many of these are nominal variables
- Select one of the variables that is NOT nominal. Click the explore tab.
- Write one paragraph describing these results. Include a screen shot of your results
- Click the COMPUTE SUMMARY STATISTICS tab
- Write a one paragraph summary of these results, only hitting the high(low)lights such as 98% of the data for variable v_1980 are missing.
Obviously this isn’t a feasible assignment if you have 6,000 variables, but I try to have courses that increase gradually in order of difficulty, starting with a relatively small data set and then going to gradually larger and more complex ones.
Most likely, you,too, have experienced homicidal urges when confronted with a problem you have spent five hours trying to solve on your computer, only to call tech support and have them report,
Well, it works fine on my computer.
You’d think if that solved the problem that they would offer to box up their computer and send it over to your house but, alas, they never do.
This is the reason that any software I use for class I test on several computers under different conditions. After having initially failed to get SAS On-Demand for Enterprise Miner to work with boot camp on the Mac, I tried it on a Lenovo machine running Windows 8. I had to install the JRE and ignore a few security warnings, but after that it worked.
[For how I did eventually get it working with boot camp, click here, and thank Jason Kellogg from SAS. ]
Next, I needed to upload some data. The SAS instructions say to use your favorite FTP client and coincidentally, I do have a favorite FTP client (Filezilla), so I downloaded it to the testing machine.
Only the professor can upload data to the class directory, and most professors probably have an FTP program on their personal computer (or maybe not, do you?) Even if you normally do, you may, like me, have borrowed a machine to use for testing or have a new computer. Whatever, this just reinforces my argument that you should never, never plan to use any kind of software in a class unless you have ample time to prepare.
I know that there are schools that ask adjuncts to teach on a week or two notice. That seems to me a recipe for disaster for both the professor and students, unless maybe you are doing something that hasn’t changed in 50 years and requires no technology, like reading Chaucer, I recommend you follow the advice of Nancy Reagan and “Just say no.”
Here are my first few hints:
- Test the software on multiple machines and multiple operating systems.
- Make sure one of those machines is on the older, under-powered end of the spectrum, as students often don’t have a lot of extra cash and may not have the shiniest, newest machine like you have on your desk.
- Test it on the latest operating system. It may turn out that the version your school has does not work with Windows 11. (I did not have that problem with the Enterprise Miner this time, but I’ve had it with other software in the past so it is a good idea.)
- Find out what other software you might need, for example, some kind of FTP program in this case, and install it on your computer, if necessary.
- Give yourself plenty of time to do all of the above.
You might think these types of things would be handled by the information technology department at your university, and you may be really lucky and that will be so. In many schools, the IT department basically helps re-set passwords, assigns school email addresses, helps to get discounts on software and upload files to Blackboard and not much else.
For years, I have been trying to figure out where the $50,000 a year or so tuition goes. It isn’t to adjunct professors and it isn’t to the IT staff. It also isn’t to buying the latest technology because, more and more often, students are expected to bring their own device.
You may think that none of the above should be your job and you may be right, but I am just saying if you want to anticipate the frustrations your students will experience and be able to solve their problems during the lecture by directing them to a link on your class website/ blog your life and theirs will both be a lot easier.
A while ago, I posted about Women in Tech, the double standard where women have to be twice as outstanding to be a keynote speaker, for example. The past year, I’ve been really cutting down on travel, for example, I didn’t go to either SAS Global Forum or the Joint Statistical Meetings, because I’m focusing on 7 Generation Games, which is growing fast.
Then, Frank and Ethan contacted me and said,
Hey, we need a keynote speaker for the Western Users of SAS Software conference. Are you busy?
Some discussion ensued during which they elicited a binding oath not to swear, threaten or otherwise defame any individual or company during the presentation and they promised me travel expenses, an unreasonable quantity of the adult beverage of my choice and a box of cookies.
This was definitely an offer I could refuse. I do have an MBA, after all, and the compensation does not exactly rival my normal hourly rate, in fact, it doesn’t beat the hourly rate of the young person who made this coffee I’m drinking.
Still, after ranting (more than once) about how women are not visible in Silicon Valley, I felt too much of a hypocrite to turn down the opportunity to be the keynote speaker at a software conference in Silicon Valley -adjacent San Jose (cue all my friends who graduated from San Jose State insisting it is indeed Silicon Valley, to which I reply, “Ha!”)
The presentation is
“LEAN IN” WITH SAS
A major reason for learning SAS (and why I teach it to students) is that it can prepare one to do something else. SAS can be a great gateway drug for other programming languages and a career as a developer. Too many people are hesitant to take that next step. Why?
See, you always thought I just made stuff up as I went along, but no I have actually an entire title and four sentences six months in advance. (Why? is a complete sentence as decided by me, the grammar supreme court of this blog).
Now, I have to go read that book, Lean In, for two reasons:
- If I’m going to reference it in the title, I probably should have read it.
- My initial reaction to having to read it was, “Oh great, another book on success by some privileged idiot who was born on third-base, thinks she hit a triple and now is lecturing the rest of us on how to get home runs. ” It occurred to me that my reaction was solely based on what I knew about the author. However, I was raised with the belief that all prejudice is wrong and that includes bias against rich, white people as well as against poor, black people. As penance, I am now going to go read the book. If it truly does suck, I will let you know. I hope you all appreciate this.
Frank and Ethan, I also want you to note that I did not swear in this post, not even once.
You’re fucking welcome.
Thank you to Jason Kellogg from SAS Technical Support, SAS On-Demand Enterprise Miner is now running on my Mac using Windows 8.1 with boot camp. Here were his instructions.
The steps are: 1. Download and save jre-6u24-windows-i586.exe. http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html#jre-6u24-oth-JPR 2. Open the Windows Run window and run "C:\users\[userid]\Downloads\jre-6u24-windows-i586.exe" STATIC=1 where [userid] is your user account name 3. Click OK to start the installation 4. After finishing the installation, on the desktop, right click empty area and select “Create Shortcut” (NOTE: on Windows 8.1 this was NEW and then SHORTCUT) 5. In the location, Browse to Desktop and click Next 6. In the next screen provide name of shortcut, for example “Enterprise MinerJWS” 7. Once the shortcut is created, Right Click and select Properties. In the Target enter the following: "C:\Program Files (x86)\Java\jre1.6.0_24\bin\javaws.exe" https://academic93.oda.sas.com/SASEnterpriseMinerJWS/main.jnlp 8. Click Apply You now have a clickable shortcut to Enterprise Miner. Please use it when starting Enterprise Miner.
This worked and I now have SAS Enterprise Miner working on my laptop, which is going to be extremely convenient.
PLEASE NOTE THAT ALL OF THE QUOTATION MARKS NEED TO BE THERE OR IT WILL GIVE YOU AN ERROR.
ALSO, under #7 that is all one command. I had to break into two lines on this blog to be legible.
Although it was still a huge pain in the ass to get started, it is leaps and bounds ahead of the first time I tried Enterprise Miner years ago.
Back then, it required back flips and sacrificing a chicken (okay, finding a machine running Windows XP, installing a bunch of files – just take my word it was a pain in the ass). As for the on-demand version, it was so slow as to be useless.
In contrast, once I got up and running, it was not bad at all, and that was running off the wireless in the office. Now, our internet speed is good here, so your mileage may vary, but at least under good conditions it runs fine using a small dataset.
So, I just uploaded a dataset with 10,000 records and 6,000 variables. We’ll see what it does with that.
==== Random shameless plug =====
When I’m not playing around with statistical software, I’m running a company that makes adventure games to teach math. If you want your children to do something educational this summer, you can buy a copy here for $9.99.
A few years ago, when I was at USC, I tried to get a desktop version of Enterprise Miner to run on a virtual machine on my Mac and that never happened, although I did get it working on a Windows machine I had at home.
Since then, I have successfully installed Enterprise Miner and started it using a Windows native machine.
Sadly, the same cannot be said for my Macs. Using boot camp on two different Macs, one running Windows 7 and another with Windows 8.1 I have had the same problems.
Be aware that if you are going to run Enterprise Miner on any operating system you are going to need at least some idea of what a C: prompt is and feel comfortable poking around things like .dll files.
You might think that this can be assumed and goes without saying if you are teaching, or even taking, a course in data mining. You would be wrong. Nothing can be assumed or goes without saying. Trust me on this.
I am not going to assume that you checked your configuration and the appropriate Java Runtime Environment is installed. If that is not the case,or you are not sure, go here and take care of that now. (See how this not assuming thing works?)
If that is taken care of, regardless of operating system, you will probably have a problem on Java security blocking the application from starting. For me, changing Java security setting to medium fixed that on all 3 machines. I tried several other things that did NOT fix it. To find your Java security settings, you can go to the control panel (in Windows 8, search for control panel first) and then search for Java with control panel. Click on Java, then the security tab to find the slider to move to medium.
At this point, the Windows machine worked, even though I had to click on several boxes where Java asked me was I ***SURE*** I wanted to do this.
With the Mac though, after I click on Start SAS On Demand Software, Enterprise Miner – it downloads a main.jnlp file which when I open it, I eventually get a message an error exists in the user services configuration. You can see screenshots here The same exact problems occurred with both Mac computers running boot camp.
The ever-helpful Rebecca Ottesen said that two of her students using Macs last semester had the same problem and sent me an email directing me to this site.
So, I did a PROC OPTIONS in SAS, which I had loaded on my desktop and verified that the .dll file was located where expected
— and this led me to thinking, wait a minute, my students aren’t going to have SAS loaded on their computers so what are THEY going to do to troubleshoot.
That was kind of a moot point, though, because …
When I got to step 3 and type in the command as directed in the exact directory directed.
C:\Program Files (x86)\Java\jre1.6.0_24>java -fullversion
I get the error message ‘java’ is not recognized as an internal or external command, operable program or batch file.
Now, there could be any number of other things to try but the fact is, I have other things to do and the course is not for a few months. I will keep plugging away and keep you abreast here. If I do decide to go with Enterprise Miner in the fall, I am sure these posts will be helpful references for students.
I do want to advise anyone who is thinking about using the on-demand version of Enterprise Miner to be aware that you are definitely going to have at least a few problems with getting it installed, for example, the security thing, and if you have any students using boot camp, they are going to most likely hate you.