I realized this morning that it has been a while since I posted any Python code. I’ve been a bit busy with Handyman Kevin and haven’t been doing much data science. Still, I decided it was time to carve out a couple hours this morning to practice my skills. The result are these functions, which perform basic double exponential smoothing using the Holt-Winters method. I deliberately avoided using NumPy, SciPy, or any other libraries. It isn’t that I dislike Numpy/Scipy (far from it), but you can’t always get sysadmins to install extra libraries on the machines you’re using, especially if you are a guerrilla data scientist like me.
There are a lot of different time series methods out there, and they all have their points. Holt-Winters is the one that I keep coming back to, though. One of the reasons is simplicity–I can always remember it and bang it into a spreadsheet without needing to Google anything or download libraries. About the 40th time I typed it into a spreadsheet, though, it occurred to me that it would be smart to implement it in Python so I could save some typing.
The first function, MAPE, simply calculates the mean absolute percentage error (MAPE) of a list of estimated values, as compared to a list of actual values.
The next function, holtwinters, uses Holt-Winters to predict the next three values in a time series. You need to supply two smoothing coefficients, alpha and beta, for the level and trend, respectively. Typically, you would have a pretty good idea what these were from doing similar forecasts in the past.
If you don’t know the coefficients then use the third function, holtwinters_auto, to automatically determine them. This function uses a grid search. Those of you who have read my monograph probably remember that I’m not usually wild about grid searches. In this case it makes sense, though, since you don’t usually need more than a few digits of precision on the coefficients.
def MAPE(actual, estimate): '''Given two lists, one of actual values and one of estimated values, computes the Mean Absolute Percentage Error''' if len(actual) != len(estimate): print "ERROR: Lists not the same length." return  pcterrors =  for i in range(len(estimate)): pcterrors.append(abs(estimate[i]-actual[i])/actual[i]) return sum(pcterrors)/len(pcterrors)
def holtwinters(ts, *args): '''Uses the Holt-Winters exp. smoothing method to forecast the next three points in a time series. The second two arguments are smoothing coefficients, alpha and beta. If no coefficients are given, both are assumed to be 0.5. ''' if len(args) >= 1: alpha = args else: alpha = .5 findcoeff = True if len(args) >= 2: beta = args else: beta = .5 if len(ts) < 3: print "ERROR: At least three points are required for TS forecast." return 0 est =  #estimated value (level) trend =  #estimated trend '''For first value, assume trend and level are both 0.''' est.append(0) trend.append(0) '''For second value, assume trend still 0 and level same as first actual value''' est.append(ts) trend.append(0) '''Now roll on for the rest of the values''' for i in range(len(ts)-2): trend.append(beta*(ts[i+1]-ts[i])+(1-beta)*trend[i+1]) est.append(alpha*ts[i+1]+(1-alpha)*est[i+1]+trend[i+2]) '''now back-cast for the first three values that we fudged''' est.reverse() trend.reverse() ts.reverse() for i in range(len(ts)-3, len(ts)): trend[i] = beta*(ts[i-1]-ts[i-2])+(1-beta)*(trend[i-1]) est[i] = alpha*ts[i-1]+(1-alpha)*est[i-1]+trend[i] est.reverse() trend.reverse() ts.reverse() '''and do one last forward pass to smooth everything out''' for i in range(2, len(ts)): trend[i] = beta*(ts[i-1]-ts[i-2])+(1-beta)*(trend[i-1]) est[i]= alpha*ts[i-1]+(1-alpha)*est[i-1]+trend[i] '''Holt-Winters method is only good for about 3 periods out''' next3 = [alpha*ts[-1]+(1-alpha)*(est[-1])+beta*(ts[-1]-ts[-2])+(1-beta)* trend[-1]] next3.append(next3+trend[-1]) next3.append(next3+trend[-1]) return next3, MAPE(ts,est)
def holtwinters_auto(ts, *args): '''Calls the holtwinters function, but automatically determines the alpha and betta coefficients which minimize the error. The optional argument is the number of digits of precision you need for the coefficients. The default is 4, which is plenty for most real life forecasting applications. ''' if len(args) > 0: digits = args else: digits = 4 '''Perform an iterative grid search to find minimum MAPE''' alpha = .5 beta = .5 for d in range(1,digits): grid =  for b in [x * .1**d+beta for x in range(-5,6)]: for a in [x * .1**d+alpha for x in range(-5,6)]: grid.append(holtwinters(ts, a, b)[-1]) if grid[-1]==min(grid): alpha = a beta = b next3, mape = holtwinters(ts, alpha, beta) return(next3, mape, alpha, beta)
As graduation nears, I look at my computer desktop and realize that most of the academic software licenses will expire before I start my next graduate program. For a data scientist, this is serious. How am I going to cross-tab survey results without SPSS? Am I going to have to do my stepwise regressions manually now? How am I going to create presentation quality geographic displays without Tableau? What package am I going to use for linear algebra? I freak out pretty badly, until I remember that just about every application you could want for data analysis is available for free on a GPL license that never expires.
Some of the software I use is already open source. Data scientists’ two favorite programming languages, Python and R, are already open source. The same goes for Linux, the operating system that runs on five of my seven computers. Many of the applications I use, though, are commercial but either the company or my university gives me a free license. What follows is a quick survey of open source alternatives for the most commonly used software.
Spreadsheet applications are to the data analyst what a table saw is to a woodworker: the big tool in the middle of the shop that gets used somehow in nearly every project. For many of us, Excel is the first spreadsheet we learn, probably because it is standard equipment on most office and university computers.
Excel is the Chrysler New Yorker of spreadsheet applications–huge and comfortable but not too nimble, loaded with lots of features that are nice to have, but you don’t really need them. Then again some features, like the way Excel handles data tables, advanced filtering, and pivot tables, can save a lot of time. Even the conditional formatting is nice to have. Plus, if you need to interference with business major types, Excel will be the only spreadsheet they’ve ever heard of.
Excel has plenty of drawbacks too, though. It is a huge program. It only runs on Windows (or OS X, if you don’t need to run any add-ins). The only natively supported scripting language is VBA. Perhaps worst of all, and unforgivably, its slow. If you haven’t noticed this for yourself, go try and do some sensitivity analysis on a simulation with 10,000 or more trials. Expect to have time for two or three cups of coffee every time you press F9.
Gnumeric is a totally different take on the concept of spreadsheets. When I first experimented with it about 15 years ago, I concluded that it was too limited to be useful. Since then, however, the the project has reinvented itself as the the lightweight, stripped down spreadsheet for data analysis. If Excel is a Chrysler New Yorker, then Gnumeric is a Dodge Dart.
In recent years, an SPSS style “statistics” menu has appeared in the Gnumeric interface. Now the most-used features of a spreadsheet and a statistics package are within easy reach, which will appeal to anyone who ever spent a morning clicking back and forth between Excel and SPSS while they analyzed a data set.
By far the most appealing feature of Gnumeric is that it uses Python as one of its scripting languages. This means that not only is it painless to create user functions but, given the multitude of libraries available for Python, you probably won’t need to very often. Excel’s non-linear solver seems pretty rudimentary when you have access to scipy.optimize. Also, since Gnumeric allows Python and C plug-ins, it is useful as a graphic front end to more complicated programs written in these languages. Pretty cool, especially considering the whole application is still lightweight enough to run on a $100 garage sale computer.
Other possibilities: OpenOffice (aka LibreOffice) is also free and seems to be designed as a more direct replacement for Excel. If you need a more general purpose spreadsheet it might be a good choice.
When all you need to do is calculate a few confidence intervals or run a T-test, a spreadsheet application will probably be adequate. If you are creating complex statistical models, you are probably going to write them in R. Between these extremes, about 90% of statistics work gets done in a statistics package. Which one you prefer probably depends on which one your college statistics professor used. The thing they all have in common, however, is that a full license costs a small fortune. Usually, the coolest add-on packages (for simulation, predictive analytics, etc) are even more of a buy-up. Luckily, for those of us of modest means, there is PSPP.
PSPP is intended as a direct clone of SPSS, but is GPL licensed, so it is completely free to use. One important difference is that PSPP is written mostly in Python notice a theme here?). This means that if you are Python hacker, you should have an easy time creating add-ins. It also means that PSPP runs on just about any platform that runs Python, which is nearly all of them.
Linear Algebra System
When you start building series matrix-heavy models, such as anything involving Markov chains or finite element analysis, you are going to want to seriously think about using a language that is built for linear algebra. Sure, Python has good linear algebra support though <a “href=http://numpy.org”>numpy</a> and other libraries. But Python is basically a general purpose, list based language. Linear algebra looks ugly in Python, and ugly code takes longer to write and is harder to debug. Its better to use the right tool for the job.
The first good language for linear algebra is Matlab. It is still incredibly popular, especially among the PhD Engineering crowd. For decades now, however, there has been a free open source alternative. Octave started out as clone of Matlab and the language syntax is still very similar. However, development on Octave often moves a little faster than Matlab, and Octave often gets features and functions before Matlab. This has cause Octave to become somewhat less compatible with Matlab as time goes on. Still, anyone who can code in Matlab should have not trouble picking up Octave, and Octave is free.
Image Editing and Drawing Software
While few of us are visual artists, either by training or inclination, visual display of data is an important part of data analysis. It is important to be able to illustrate our findings in professional quality maps, charts, and diagrams. To do this, you need to have the ability to work with bitmap images and raster drawings. The Adobe products have been the commercial state of the art for some time now, but GIMP and Inkscape have most of the same functionality. Like the commercial programs, you tend to use them together as a team. One word of warning: if you are migrating from Adobe to the open source programs, you will find that the interfaces are very different, at least in the stock configuration. These programs are replacements, not clones.
One interesting feature of both GIMP and Inkscape is that both allow you to create scripts and plug-ins. One of the languages that is supported is (you guessed it) Python. While I haven’t experimented with it much, it seems like there is the potential here for some pretty serious data display. For instance, it seems like you could build a pretty killer mapping plug-in for Inkscape to display geographic data on different layers of a drawing.