So I say,
“There may be some useful information in the text fields. For example, people who use credit to buy commodities, such as meat may differ from people who buy finished goods like clothing who may differ from people who purchase machinery on credit. Perhaps you may want to consider some kind of clustering.”
The very bright young people nod and one says quite brightly to another,
“Well, you better get out your grep statements.”
All of the rest nod in pleased agreement, while I say to my old and faded self,
Grep? That’s what you’ve got? Seriously? What. The. Fuck.
Okay, ten points for the bright young people for knowing any Unix commands or even that Unix exists, which puts them ahead of a lot of people. However, please explain to me why in the name of God you would not even consider using something like Statistica or SAS Enterprise Miner ? There is an R text mining package. I have never used it because I don’t use R (long discussion of that here) but these young people had spent three semesters learning to program R and did not even know it existed.
At one point I was playing around with both Ruby and SAS to write a program to parse text. Do you have any idea how much of a pain in the ass that is? In that case, because it was on a set of data with a VERY limited scope, we could do it by using just a few hundred words. It was a small project with a very small budget and at the time I was wanting a project that gave me an excuse to learn Ruby.
For a more general project with the whole English language as its scope, that would be an insane undertaking. It would cost the client several times the cost of buying a SAS license or Statistica (not sure about the SPSS offering) – and what I could write would not be within shouting distance of as well done and comprehensive as something a team of people had worked on for years.
The most recent client who asked me this actually has a SAS Enterprise Miner license at their organization! (So, yes, while the license fee is humongous, since it had already been paid, the additional cost to use it on this project would be zero.)
I started on it and after about ten minutes of reflection realized there were probably dozens of jQuery plug-ins that did slide shows of every size, shape and form. Sure enough, 5 seconds on Google gave me a couple to download.
I downloaded one, modified it a bit and it was okay, though I’m not sure it is exactly what I want either. When I looked at the code in detail, it was evident the author had done the same as me, downloaded someone else’s code and modified it, because there were entire directories in there that did nothing. So, I deleted those.
After playing with that for a while, I thought perhaps there were other, better, slideshow plug-ins available. I downloaded another one because, even though I knew it probably wouldn’t suit my purpose, it was written so much more succinctly, I found it interesting.
So …. two lessons
- Don’t waste your time creating something that has already been created.
- Even if you do want to create something, either just for the hell of it or as a learning experience, you’ll probably learn a lot and end up with a better product, faster, if you build on what other people have already done rather than start with a blank page and Notepad++