Thought I would try out Kaggle. It’s a site for data analysis contests. First one I tried was the benchmark ‘digit recognizer’. This is a set of greyscale images of the digits 0,1,…,9. Your task is to knock out a model which can recognize them. Thought I would try something pretty ‘safe’.
Step one was principal components analysis. This reorganises the images, which are effectively 784-dimensional vectors. You get out a different 784 dimensional vector, which contains the same information, but where the first number represents the image’s relationship to the most important ‘pattern’ in the data. The second number to the second most important ‘pattern’ etc. Here’s a picture of the first such ‘pattern’ or basis vector turned into a 28×28 picture:
Which has a lot of the 8,9 and 3 about it. By looking only at the relationship of an image to the top few such patterns you can discard a lot of irrelevant information. This cuts down on computation time, and reduces the scope for coming up with models that are way too complex. I got the inspiration from Gilbert Strang’s lectures on linear algebra, particularly (think it was number 31) on the singular value decomposition.
Second step was to use a pretty stock support vector machine solution to classify the points. As always the bulk of the actual work was on data prep and thinking about what might be a good model. This solution got 98.3% accuracy and has me in at 13 out of 400 and something (80 better than the benchmark).
The next challenge I started was Gefcom windpower generation forecasting. This is a time series challenge where you have to predict power generation at wind farms up to 48 hours ahead. I wanted to improve my feel for practical time series work and this project taught me a lot. Notably not to treat a segment of a time series as if it were a set of independent observations. Of course I knew this was a bad idea. The professors tell you. The books tell you. I listened, I nodded, but wanted to see what would happen. How wrong would it actually go?
My solution was just over three times better than the benchmark, with residual errors of about 3.8% rather than 13%. But not so good as the leaders, which seem to be clustering on 2.3%. This was also a cash prize contest with $7500 at stake, so competition is both international and pretty fierce. I like that – done as a toy project it would be easy to say, “3.8%… that’s not bad.” Seeing quantitatively that it can be done better just makes me curious as to how. Also, if I placed around the median (not closed yet though) in my first international professional sports race (and this is analogous) I wouldn’t be too disappointed.
If I started again I would not mess around with the data in the way I did. Much time can be wasted processing data into a format suitable for a poorly chosen model. On the other hand, I’m much, much better with MongoDB than I was prior to the challenge. My knowledge of various R packages is also much improved, in particular tuning neural networks.
An ARIMA model would probably be a good alternative. Alas I’ve now run out of time to try it. University has taken over my evenings again. So I’m back to the books and formal statistics and away from the wilds of Kaggle for a bit. My ears will be now be wide open and brain receptive for the time series stuff later this term.