As graduation nears, I look at my computer desktop and realize that most of the academic software licenses will expire before I start my next graduate program. For a data scientist, this is serious. How am I going to cross-tab survey results without SPSS? Am I going to have to do my stepwise regressions manually now? How am I going to create presentation quality geographic displays without Tableau? What package am I going to use for linear algebra? I freak out pretty badly, until I remember that just about every application you could want for data analysis is available for free on a GPL license that never expires.
Some of the software I use is already open source. Data scientists’ two favorite programming languages, Python and R, are already open source. The same goes for Linux, the operating system that runs on five of my seven computers. Many of the applications I use, though, are commercial but either the company or my university gives me a free license. What follows is a quick survey of open source alternatives for the most commonly used software.
Spreadsheet applications are to the data analyst what a table saw is to a woodworker: the big tool in the middle of the shop that gets used somehow in nearly every project. For many of us, Excel is the first spreadsheet we learn, probably because it is standard equipment on most office and university computers.
Excel is the Chrysler New Yorker of spreadsheet applications–huge and comfortable but not too nimble, loaded with lots of features that are nice to have, but you don’t really need them. Then again some features, like the way Excel handles data tables, advanced filtering, and pivot tables, can save a lot of time. Even the conditional formatting is nice to have. Plus, if you need to interference with business major types, Excel will be the only spreadsheet they’ve ever heard of.
Excel has plenty of drawbacks too, though. It is a huge program. It only runs on Windows (or OS X, if you don’t need to run any add-ins). The only natively supported scripting language is VBA. Perhaps worst of all, and unforgivably, its slow. If you haven’t noticed this for yourself, go try and do some sensitivity analysis on a simulation with 10,000 or more trials. Expect to have time for two or three cups of coffee every time you press F9.
Gnumeric is a totally different take on the concept of spreadsheets. When I first experimented with it about 15 years ago, I concluded that it was too limited to be useful. Since then, however, the the project has reinvented itself as the the lightweight, stripped down spreadsheet for data analysis. If Excel is a Chrysler New Yorker, then Gnumeric is a Dodge Dart.
In recent years, an SPSS style “statistics” menu has appeared in the Gnumeric interface. Now the most-used features of a spreadsheet and a statistics package are within easy reach, which will appeal to anyone who ever spent a morning clicking back and forth between Excel and SPSS while they analyzed a data set.
By far the most appealing feature of Gnumeric is that it uses Python as one of its scripting languages. This means that not only is it painless to create user functions but, given the multitude of libraries available for Python, you probably won’t need to very often. Excel’s non-linear solver seems pretty rudimentary when you have access to scipy.optimize. Also, since Gnumeric allows Python and C plug-ins, it is useful as a graphic front end to more complicated programs written in these languages. Pretty cool, especially considering the whole application is still lightweight enough to run on a $100 garage sale computer.
Other possibilities: OpenOffice (aka LibreOffice) is also free and seems to be designed as a more direct replacement for Excel. If you need a more general purpose spreadsheet it might be a good choice.
When all you need to do is calculate a few confidence intervals or run a T-test, a spreadsheet application will probably be adequate. If you are creating complex statistical models, you are probably going to write them in R. Between these extremes, about 90% of statistics work gets done in a statistics package. Which one you prefer probably depends on which one your college statistics professor used. The thing they all have in common, however, is that a full license costs a small fortune. Usually, the coolest add-on packages (for simulation, predictive analytics, etc) are even more of a buy-up. Luckily, for those of us of modest means, there is PSPP.
PSPP is intended as a direct clone of SPSS, but is GPL licensed, so it is completely free to use. One important difference is that PSPP is written mostly in Python notice a theme here?). This means that if you are Python hacker, you should have an easy time creating add-ins. It also means that PSPP runs on just about any platform that runs Python, which is nearly all of them.
Linear Algebra System
When you start building series matrix-heavy models, such as anything involving Markov chains or finite element analysis, you are going to want to seriously think about using a language that is built for linear algebra. Sure, Python has good linear algebra support though <a “href=http://numpy.org”>numpy</a> and other libraries. But Python is basically a general purpose, list based language. Linear algebra looks ugly in Python, and ugly code takes longer to write and is harder to debug. Its better to use the right tool for the job.
The first good language for linear algebra is Matlab. It is still incredibly popular, especially among the PhD Engineering crowd. For decades now, however, there has been a free open source alternative. Octave started out as clone of Matlab and the language syntax is still very similar. However, development on Octave often moves a little faster than Matlab, and Octave often gets features and functions before Matlab. This has cause Octave to become somewhat less compatible with Matlab as time goes on. Still, anyone who can code in Matlab should have not trouble picking up Octave, and Octave is free.
Image Editing and Drawing Software
While few of us are visual artists, either by training or inclination, visual display of data is an important part of data analysis. It is important to be able to illustrate our findings in professional quality maps, charts, and diagrams. To do this, you need to have the ability to work with bitmap images and raster drawings. The Adobe products have been the commercial state of the art for some time now, but GIMP and Inkscape have most of the same functionality. Like the commercial programs, you tend to use them together as a team. One word of warning: if you are migrating from Adobe to the open source programs, you will find that the interfaces are very different, at least in the stock configuration. These programs are replacements, not clones.
One interesting feature of both GIMP and Inkscape is that both allow you to create scripts and plug-ins. One of the languages that is supported is (you guessed it) Python. While I haven’t experimented with it much, it seems like there is the potential here for some pretty serious data display. For instance, it seems like you could build a pretty killer mapping plug-in for Inkscape to display geographic data on different layers of a drawing.