Do Not Re-Invent the Software Wheel unless you’re into wheel-inventing

ByAnnMaria De Mars May 7, 2012May 8, 2012

So I say,

“There may be some useful information in the text fields. For example, people who use credit to buy commodities, such as meat may differ from people who buy finished goods like clothing who may differ from people who purchase machinery on credit. Perhaps you may want to consider some kind of clustering.”

The very bright young people nod and one says quite brightly to another,

“Well, you better get out your grep statements.”

All of the rest nod in pleased agreement, while I say to my old and faded self,

Grep? That’s what you’ve got? Seriously? What. The. Fuck.

Okay, ten points for the bright young people for knowing any Unix commands or even that Unix exists, which puts them ahead of a lot of people. However, please explain to me why in the name of God you would not even consider using something like Statistica or SAS Enterprise Miner ? There is an R text mining package. I have never used it because I don’t use R (long discussion of that here) but these young people had spent three semesters learning to program R and did not even know it existed.

At one point I was playing around with both Ruby and SAS to write a program to parse text. Do you have any idea how much of a pain in the ass that is? In that case, because it was on a set of data with a VERY limited scope, we could do it by using just a few hundred words. It was a small project with a very small budget and at the time I was wanting a project that gave me an excuse to learn Ruby.

For a more general project with the whole English language as its scope, that would be an insane undertaking. It would cost the client several times the cost of buying a SAS license or Statistica (not sure about the SPSS offering) – and what I could write would not be within shouting distance of as well done and comprehensive as something a team of people had worked on for years.

The most recent client who asked me this actually has a SAS Enterprise Miner license at their organization! (So, yes, while the license fee is humongous, since it had already been paid, the additional cost to use it on this project would be zero.)

While this is the latest, most outrageous example, the “Do-It-Yourself” fallacy happens all the time. Recently, I needed to do a slideshow. I thought I would write it using javascript/ jQuery because that is something else I have been wanting to learn better and the Codecademy thing just didn’t do it for me. I wanted an actual project.

I started on it and after about ten minutes of reflection realized there were probably dozens of jQuery plug-ins that did slide shows of every size, shape and form. Sure enough, 5 seconds on Google gave me a couple to download.

I downloaded one, modified it a bit and it was okay, though I’m not sure it is exactly what I want either. When I looked at the code in detail, it was evident the author had done the same as me, downloaded someone else’s code and modified it, because there were entire directories in there that did nothing. So, I deleted those.

After playing with that for a while, I thought perhaps there were other, better, slideshow plug-ins available. I downloaded another one because, even though I knew it probably wouldn’t suit my purpose, it was written so much more succinctly, I found it interesting.

So …. two lessons

Don’t waste your time creating something that has already been created.
Even if you do want to create something, either just for the hell of it or as a learning experience, you’ll probably learn a lot and end up with a better product, faster, if you build on what other people have already done rather than start with a blank page and Notepad++

Software

Super-easy fix to SAS Libname / Missing Directory
ByAnnMaria De Mars November 16, 2014

In August, I attended a class at Unite 2014 (on Unity game development) and the presenter said, “And some of you, your code won’t run and you’ll swear you did exactly what was shown in the examples. But, of course, all of the rest of us will know that is not true.” This perfectly describes…

Read More Super-easy fix to SAS Libname / Missing Directory
computer games | Software | Technology

How SAS Helped Me Make Our Best-Selling Educational Game: Part 2
ByAnnMaria De Mars February 15, 2018

Last time, I gave a bit about the requirements of a game to match the most synonyms in one minute, and how what I learned using SAS was a basis for several parts of the game. This activity is going into Making Camp Premium, which will be a paid version of our best-selling game, Making…

Read More How SAS Helped Me Make Our Best-Selling Educational Game: Part 2
Software | statistics

Excel statistics functions – simple answers to simple questions
ByAnnMaria De Mars December 30, 2012December 30, 2012

I have colleagues who hate Excel with a passion. Why, they demand to know, would ANYONE use Excel for statistics when there are so many options that are so much better? Actually, I don’t find the Excel add-on for statistics that terrible, but that isn’t even the topic of this post. I use Excel because…

Read More Excel statistics functions – simple answers to simple questions
Software | statistics | Technology

SAS Studio – where and wow
ByAnnMaria De Mars September 22, 2014

I’m pretty certain I did not deliberately hide these folders. When I opened up my new and improved SAS Studio, it had tasks but my programs were missing. If this happens to you and you are full of sadness missing your programs, look to the top right of your screen where you see some horizontal…

Read More SAS Studio – where and wow
Dr. De Mars General Life Ramblings | Technology

Who’s a 10X Engineer?
Byannmaria July 16, 2019July 16, 2019

A twitter storm erupted recently in response to one person’s thread about how to find a 10x engineer . Since I started programming FORTRAN with punched cards back in 1974, was an industrial engineer in the 1980s and now run a software company, I’ve worked with a few people, rightly or wrongly considered to fall…

Read More Who’s a 10X Engineer?
Software | statistics

Genetic Algorithm: SAS Global Forum 2015 Keeps on Giving
ByAnnMaria De Mars June 9, 2015June 9, 2015

Do you have a bunch of sites bookmarked with articles you are going to go back and read later? It’s not just me, is it? One of my (many) favorite things at SAS Global Forum this year was the app. It included a function for emailing links to papers you found interesting. Perhaps the theory…

Read More Genetic Algorithm: SAS Global Forum 2015 Keeps on Giving

11 Comments

disgruntledphd says:

May 7, 2012 at 4:53 am

Well, without detracting too much from your major point, grep is really, really, really fast. Also if you pipe the output into awk or sed you can get some awesome data-manipulation work done pretty quickly.

R also has a grep function, but as its based on PCRE (Perl Compatible Regular Expressions), it can be somewhat slower than pure grep.

Seriously, if you have a crummy computer or a really big dataset, grep is very useful.

That being said, the tm or RTextTools packages are the way to go for anything serious (if you’re using R).

Incidentally, if any R users are reading this and you want to know if R can do something:
install.packages(“sos”)
require(sos)
findFn(“whatever you want to do”).

This is an absolute life saver, but for some reason people don’t know about it.
Wesley says:

May 7, 2012 at 9:05 am

I’m with disgruntledphd: the combination of grep, awk, and sed is incredible for *some* applications. Their main advantages are speed and piping; if you’re going to do some in the pipeline as well, the Unix terminal is a godsend. If all you want is to parse some text for keywords … they’re overkill, obviously.
Pingback: Stop Reinventing the Damned Wheel, You WheelReinventer. Stop. Now. | DataGeeks-MSP
AnnMaria says:

May 8, 2012 at 4:03 pm

Not saying that grep is a bad thing, and, as I said, props to them for knowing that Unix even exists in a world where most people are puzzled by a C:\

“What’s a ‘see prompt’? Where do I see it? I don’t see anything.”

My point, though, is that we sometimes start by writing a program to solve a problem for which a solution could be downloaded in 30 seconds.
Adele Horford says:

May 8, 2012 at 9:46 pm

> why in the name of God [would you]
> not even consider using something
> like Statistica or SAS Enterprise Miner

I would’ve expressed my surprise with the same intensity but having reversed the clauses.

Certainly if you have Statistica or SAS, _and_ you’ve spent the time to learn (a) what tools come in the toolset, and (b) how to use those tools, you should proceed with haste.

If you haven’t, and are faced with the choice of having to learn grep/sed/awk, or RTextTools, or SPSS, I know which one I’d advise.
AnnMaria says:

May 8, 2012 at 11:26 pm

Adele –
I think that depends on the size of your data set and your budget. Certainly, a SAS license is not for the faint of budget – although it is free for students registered in courses using it.

One of my personal reasons for doing more programming in other languages these days is that the SAS cost is pretty extreme and I don’t want to be tied to it.

Seriously, though, writing your own program to parse, filter and cluster text is a MASSIVE task.

Sure, if you have a few hundred records and a limited content area, you can cobble something together in not too much time, maybe even with a few thousand records. With a million? Not a chance. Simply typing in all the words in the dictionary will take you weeks.

This is one of the reasons that a good question to ask in job interviews is “What is the largest data set you have worked on?”

Solutions that are fine for smaller data sets are just not appropriate for larger ones.

The reverse is also true. I’ve also seen people use a star schema for a project with 300 records because “that’s the way it’s supposed to be done”

From the perspective of someone in a business, my thought is I want to hire people who DO know what tools are out there and can use them. I don’t want someone who sees every task as a nail because all they have is a hammer.
Meta Brown says:

May 9, 2012 at 10:32 am

Good for your old-fashioned self! Building from scratch doesn’t just waste valuable programming time, it also has a significant opportunity cost – time elapsed while building a solution you could have bought is time that you don’t have the results to use for the business. You are also likely to end up with less functionality, more errors, poor documentation and a greater burden of maintenance.

I hear the comments about cost, and that’s a real concern. Still, everyone should be careful to evaluate the true cost of new development. Buying is usually cheaper than building. And there are a variety of alternatives for most types of analytics, so there should be no building without a serious exploration of the alternatives.
Dean Abbott says:

May 9, 2012 at 10:57 am

Most analyses do not need customized software, so using an off-the-shelf tool is the most cost effective way to solve the problem. If you can’t afford Statistica, buy JMP. If you can’t afford JMP, use KNIME. But don’t start with MLC++ or Java code from WEKA or going to 2 weeks of R bootcamp.

I’m definitely a “use software that’s been written already” kind of guy now. And know that I’ve done the other route and have written lots of C/C++/FORTRAN/Numeric Recipes/SQL/Matlab/Mathematica/regex/grep/awk/sed/bash/tcsh/S/etc etc etc code to do basic and customized processing. I get that. But the time it takes to learn a good data mining software tool is dwarfed by the time it takes to become proficient in a programming language, or even if you know the language well, the time it takes to create and debug code. (did someone say QR factorization?)

The key is to find a software product with good building blocks so you have sufficient flexibility to accomplish the 90% solution in 10% of the time it would take to build it from scratch. Get the win, then if you really need that last increment of performance improvement, you can justify the time it take to get there with the value you have already brought to the table.
Stefan Reich says:

May 9, 2012 at 1:31 pm

God. Building from scratch is also one of the most enjoyable and important activities.

What a stupid column.
AnnMaria says:

May 9, 2012 at 1:51 pm

Stefan –
Building from scratch IS fun and that’s why I said “unless you’re into wheel inventing”. In my post after this one I mentioned building something from scratch “Just because”. The just because in my case was both because I thought it was fun and because I wanted to learn the language better.

If those are your reasons, fine. If your reason is because you don’t know those other tools exist, or you don’t want to spend the money to buy them, then , as others have said here, you are probably going in the wrong direction.

As Dean said, those customized solutions are often best begun with from a base where 90% of the code is already written.

Our clients are billed by the hour. I can’t justify billing someone for 100 hours instead of 20 because “The solution was more fun for me to write this way and I learned a lot doing it”
Dean Abbott says:

June 2, 2012 at 12:12 pm

I think it’s great fun too. I had great fun with a customer recently building a text mining system from regular expressions (I know most of you won’t find regular expressions “fun”…).

I just don’t have enough time to do it anymore. I’m with you completely on this AnnMaria: I can’t justify custom tweaks when the actual improvement due to the tweaks aren’t significant.

Similar Posts

11 Comments

Leave a Reply