Do Not Re-Invent the Software Wheel unless you’re into wheel-inventing

So I say,

“There may be some useful information in the text fields. For example, people who use credit to buy commodities, such as meat may differ from people who buy finished goods like clothing who may differ from people who purchase machinery on credit. Perhaps you may want to consider some kind of clustering.”

The very bright young people nod and one says quite brightly to another,

“Well, you better get out your grep statements.”

All of the rest nod in pleased agreement, while I say to my old and faded self,

Grep? That’s what you’ve got? Seriously? What. The. Fuck.

Okay, ten points for the bright young people for knowing any Unix commands or even that Unix exists, which puts them ahead of a lot of people. However, please explain to me why in the name of God you would not even consider using something like Statistica or SAS Enterprise Miner ? There is an R text mining package. I have never used it because I don’t use R (long discussion of that here) but these young people had spent three semesters learning  to program R and did not even know it existed.

At one point I was playing around with both Ruby and SAS to write a program to parse text. Do you have any idea how much of a pain in the ass that is? In that case, because it was on a set of data with a VERY limited scope, we could do it by using just a few hundred words.  It was a small project with a very small budget and at the time I was wanting a project that gave me an excuse to learn Ruby.

For a more general project with the whole English language as its scope, that would be an insane undertaking. It would cost the client several times the cost of buying a SAS license or Statistica (not sure about the SPSS offering) – and what I could write would not be within shouting distance of as well done and comprehensive as something a team of people had worked on for years.

The most recent client who asked me this actually has a SAS Enterprise Miner license at their organization! (So, yes, while the license fee is humongous, since it had already been paid, the additional cost to use it on this project would be zero.)

While this is the latest, most outrageous example, the “Do-It-Yourself” fallacy happens all the time.  Recently, I needed to do a slideshow. I thought I would write it using javascript/ jQuery because that is something else I have been wanting to learn better and the Codecademy thing just didn’t do it for me. I wanted an actual project.

I started on it and after about ten minutes of reflection realized there were probably dozens of jQuery plug-ins that did slide shows of every size, shape and form. Sure enough, 5 seconds on Google gave me a couple to download.

I downloaded one, modified it a bit and it was okay, though I’m not sure it is exactly what I want either. When I looked at the code in detail, it was evident the author had done the same as me, downloaded someone else’s code and modified it, because there were entire directories in there that did nothing. So, I deleted those.

After playing with that for a while, I thought perhaps there were other, better, slideshow plug-ins available. I downloaded another one because, even though I knew it probably wouldn’t suit my purpose, it was written so much more succinctly, I found it interesting.

So …. two lessons

  1. Don’t waste your time creating something that has already been created.
  2. Even if you do want to create something, either just for the hell of it or as a learning experience, you’ll probably learn a lot and end up with a better product, faster, if you build on what other people have already done rather than start with a blank page and Notepad++

Similar Posts


  1. Well, without detracting too much from your major point, grep is really, really, really fast. Also if you pipe the output into awk or sed you can get some awesome data-manipulation work done pretty quickly.

    R also has a grep function, but as its based on PCRE (Perl Compatible Regular Expressions), it can be somewhat slower than pure grep.

    Seriously, if you have a crummy computer or a really big dataset, grep is very useful.

    That being said, the tm or RTextTools packages are the way to go for anything serious (if you’re using R).

    Incidentally, if any R users are reading this and you want to know if R can do something:
    findFn(“whatever you want to do”).

    This is an absolute life saver, but for some reason people don’t know about it.

  2. I’m with disgruntledphd: the combination of grep, awk, and sed is incredible for *some* applications. Their main advantages are speed and piping; if you’re going to do some in the pipeline as well, the Unix terminal is a godsend. If all you want is to parse some text for keywords … they’re overkill, obviously.

  3. Not saying that grep is a bad thing, and, as I said, props to them for knowing that Unix even exists in a world where most people are puzzled by a C:\

    “What’s a ‘see prompt’? Where do I see it? I don’t see anything.”

    My point, though, is that we sometimes start by writing a program to solve a problem for which a solution could be downloaded in 30 seconds.

  4. > why in the name of God [would you]
    > not even consider using something
    > like Statistica or SAS Enterprise Miner

    I would’ve expressed my surprise with the same intensity but having reversed the clauses.

    Certainly if you have Statistica or SAS, _and_ you’ve spent the time to learn (a) what tools come in the toolset, and (b) how to use those tools, you should proceed with haste.

    If you haven’t, and are faced with the choice of having to learn grep/sed/awk, or RTextTools, or SPSS, I know which one I’d advise.

  5. Adele –
    I think that depends on the size of your data set and your budget. Certainly, a SAS license is not for the faint of budget – although it is free for students registered in courses using it.

    One of my personal reasons for doing more programming in other languages these days is that the SAS cost is pretty extreme and I don’t want to be tied to it.

    Seriously, though, writing your own program to parse, filter and cluster text is a MASSIVE task.

    Sure, if you have a few hundred records and a limited content area, you can cobble something together in not too much time, maybe even with a few thousand records. With a million? Not a chance. Simply typing in all the words in the dictionary will take you weeks.

    This is one of the reasons that a good question to ask in job interviews is “What is the largest data set you have worked on?”

    Solutions that are fine for smaller data sets are just not appropriate for larger ones.

    The reverse is also true. I’ve also seen people use a star schema for a project with 300 records because “that’s the way it’s supposed to be done”

    From the perspective of someone in a business, my thought is I want to hire people who DO know what tools are out there and can use them. I don’t want someone who sees every task as a nail because all they have is a hammer.

  6. Good for your old-fashioned self! Building from scratch doesn’t just waste valuable programming time, it also has a significant opportunity cost – time elapsed while building a solution you could have bought is time that you don’t have the results to use for the business. You are also likely to end up with less functionality, more errors, poor documentation and a greater burden of maintenance.

    I hear the comments about cost, and that’s a real concern. Still, everyone should be careful to evaluate the true cost of new development. Buying is usually cheaper than building. And there are a variety of alternatives for most types of analytics, so there should be no building without a serious exploration of the alternatives.

  7. Most analyses do not need customized software, so using an off-the-shelf tool is the most cost effective way to solve the problem. If you can’t afford Statistica, buy JMP. If you can’t afford JMP, use KNIME. But don’t start with MLC++ or Java code from WEKA or going to 2 weeks of R bootcamp.

    I’m definitely a “use software that’s been written already” kind of guy now. And know that I’ve done the other route and have written lots of C/C++/FORTRAN/Numeric Recipes/SQL/Matlab/Mathematica/regex/grep/awk/sed/bash/tcsh/S/etc etc etc code to do basic and customized processing. I get that. But the time it takes to learn a good data mining software tool is dwarfed by the time it takes to become proficient in a programming language, or even if you know the language well, the time it takes to create and debug code. (did someone say QR factorization?)

    The key is to find a software product with good building blocks so you have sufficient flexibility to accomplish the 90% solution in 10% of the time it would take to build it from scratch. Get the win, then if you really need that last increment of performance improvement, you can justify the time it take to get there with the value you have already brought to the table.

  8. Stefan –
    Building from scratch IS fun and that’s why I said “unless you’re into wheel inventing”. In my post after this one I mentioned building something from scratch “Just because”. The just because in my case was both because I thought it was fun and because I wanted to learn the language better.

    If those are your reasons, fine. If your reason is because you don’t know those other tools exist, or you don’t want to spend the money to buy them, then , as others have said here, you are probably going in the wrong direction.

    As Dean said, those customized solutions are often best begun with from a base where 90% of the code is already written.

    Our clients are billed by the hour. I can’t justify billing someone for 100 hours instead of 20 because “The solution was more fun for me to write this way and I learned a lot doing it”

  9. I think it’s great fun too. I had great fun with a customer recently building a text mining system from regular expressions (I know most of you won’t find regular expressions “fun”…).

    I just don’t have enough time to do it anymore. I’m with you completely on this AnnMaria: I can’t justify custom tweaks when the actual improvement due to the tweaks aren’t significant.

Leave a Reply

Your email address will not be published. Required fields are marked *