First two Kaggle scrummages

Thought I would try out Kaggle. It’s a site for data analysis contests. First one I tried was the benchmark ‘digit recognizer’. This is a set of greyscale images of the digits 0,1,…,9. Your task is to knock out a model which can recognize them.  Thought I would try something pretty ‘safe’.

Step one was principal components analysis. This reorganises the images, which are effectively 784-dimensional vectors. You get out a different 784 dimensional vector, which contains the same information, but where the first number represents the image’s relationship to the most important ‘pattern’ in the data. The second number to the second most important ‘pattern’ etc. Here’s a picture of the first such ‘pattern’ or basis vector turned into a 28×28 picture:

Which has a lot of the 8,9 and 3 about it. By looking only at the relationship of an image to the top few such patterns you can discard a lot of irrelevant information. This cuts down on computation time, and reduces the scope for coming up with models that are way too complex. I got the inspiration from Gilbert Strang’s lectures on linear algebra, particularly (think it was number 31) on the singular value decomposition.

Second step was to use a pretty stock support vector machine solution to classify the points. As always the bulk of the actual work was on data prep and thinking about what might be a good model. This solution got 98.3% accuracy and has me in at 13 out of 400 and something (80 better than the benchmark).

The next challenge I started was Gefcom windpower generation forecasting. This is a time series challenge where you have to predict power generation at wind farms up to 48 hours ahead. I wanted to improve my feel for practical time series work and this project taught me a lot. Notably not to treat a segment of a time series as if it were a set of independent observations. Of course I knew this was a bad idea. The professors tell you. The books tell you. I listened, I nodded, but wanted to see what would happen. How wrong would it actually go?

My solution was just over three times better than the benchmark, with residual errors of about 3.8% rather than 13%. But not so good as the leaders, which seem to be clustering on 2.3%. This was also a cash prize contest with $7500 at stake, so competition is both international and pretty fierce.  I like that – done as a toy project it would be easy to say, “3.8%… that’s not bad.” Seeing quantitatively that it can be done better just makes me curious as to how. Also, if I placed around the median (not closed yet though) in my first international professional sports race (and this is analogous) I wouldn’t be too disappointed.

If I started again I would not mess around with the data in the way I did. Much time can be wasted processing data into a format suitable for a poorly chosen model. On the other hand, I’m much, much better with MongoDB than I was prior to the challenge. My knowledge of various R packages is also much improved, in particular tuning neural networks.

An ARIMA model would probably be a good alternative. Alas I’ve now run out of time to try it. University has taken over my evenings again. So I’m back to the books and formal statistics and away from the wilds of Kaggle for a bit. My ears will be now be wide open and brain receptive for the time series stuff later this term.

Intelligence in War [Random Book of the Week]

Random book of the week was John Keegan’s Intelligence in War. Not going to review the narrative, just a few of the less mundane points I took:

  • ‘War is ultimately about doing, not thinking.’ The best example in the text is the British/New Zealand force in Crete. Extensive intelligence was had Nazi invasion plans through the Ultra system of Enigma decryption. Allied force was sizeable but still lost. Keegan argues due to poor tactical decisions on the allied side. The CIA review suggests this isn’t a new position, but I’m not a frequent reader of military intelligence.
  • Numerous examples of the price of arrogance.

“Had he explained why he wanted to know where DK was, he would have been told that the German flagship left its call sign at home when proceeding to sea, to disguise its movements and adopted another. On the basis of his half-clever question, Jackson therefore telegraphed Jellicoe, commander of the Grand Fleet at Scapa Flow, to assure him that the High Seas Fleet was still in harbour.” [meaning the British were late to the fight at Jutland and...] “Admiral Jackson’s reluctance to take the codebreakers into his confidence robbed the Grand Fleet of a major opportunity to scupper the German navy for good.”

  • Democracies beat dictatorships in technology dependent fights. Example: V2 rockets were on the order of eighty times more expensive than V1s but Hitler liked them, so resources got diverted. Turned out to be a good technology (man on the moon), but just not on the time scale and for the objectives required. Similarly in the battle of the Atlantic, Keegan argues the allies innovated and produced their way out of a hole, deploying new weapons such as homing torpedos, improved sensors and prefabricated ship building. With the innovation owing in part to wider circles of communication.
  • People have trouble expecting that others are capable of feats beyond their own abilities. Germans not believing the British had centimetric radar. British not believing Germany had developed liquid fuelled rockets. The Union’s amazement at ‘Stonewall’ Jackson’s ‘foot cavalry’ marching speed. Etc.

Random business biography of the week

Duncan Bannatyne’s ‘Anyone Can Do It‘. Well worth a read, can be done in a day. My reading of his main points:

  • No age limit. While entrepreneurial as a child, Bannatyne morphed from beach bum to entrpreneur at around 30.
  • Outsider mentality. Outsider at school. Dishonourably discharged from the Navy. Liminal life in Jersey.
  • Networking isn’t all it’s cracked up to be. Selling good value products trumps networking.
  • In fact: focus on value, not price.
  • The government publishes leaflets on everything. DIY before hiring specialist advisers.
  • No speed limits. If something is working, why not do more of it?
  • Sacrifices needed to get things done, but delegation greatly increases his capacity.
  • Focus on market microstructure and regulation. When building care homes, look at how many beds per staff member. Build in multiples of that number.
  • Prefers to keep companies private. Stock exchange floatation expensive in terms of fees, and a poor long term strategy for that particular company.
  • Save paperclips. Works at home. Swanky profits over swanky offices.
  • Help to those who really need it or who help themselves. Charity for the exceptionally poor through no fault of their own: Romania.

Miserere, mei Deus

I came across Allegri’s beautiful piece of late Renaissance choral music, Miserere mei, Deus. Aside from being transportive in its own right, the historical detail regarding its development is fascinating:

“at some point, it became forbidden to transcribe the music and it was allowed to be performed only at those particular services, adding to the mystery surrounding it.

[..]

According to the popular story (backed up by family letters), the fourteen-year-old Mozart was visiting Rome, when he first heard the piece during the Wednesday service. Later that day, he wrote it down entirely from memory, returning to the Chapel that Friday to make minor corrections. Some time during his travels, he met the British historian Dr Charles Burney, who obtained the piece from him and took it to London, where it was published in 1771. Once the piece was published, the ban was lifted; Mozart was summoned to Rome by the Pope, only instead of excommunicating the boy, the Pope showered praises on him for his feat of musical genius.”

There is a lesson in there on the weakness of security through obscurity. That is assuming others are incapable of doing things you’d find difficult yourself (or being plan arrogant). There’s also the love of skill; something I think most people share.  The Pope appreciating dexterity so much as to forgive Mozart’s transgression.

I heard the version sung by The Sixteen.

 

A sip of image processing

I wanted to prepare some ground to do a little image processing. Had an idea to do a Sudoku solver for Android (answer: yes it’s already been done).

The ever-faithful scipy provides a lot of the gear needed to get going through python. It’s surprisingly easy to access images as data. Just scipy.misc.imread() to summon a N by M by 3 array of numbers.

Anyway I played around with a cellular automaton to see what would happen. I read a big chunk of Stephen Wolfram’s book a while back and thought they’d provide a nice point to generate some pictures from. I find them both curious, as simple ways to generate complex behaviour, and attractive.


 

 

 

Hello World for PyML

After a while concentrating on more abstract stuff I thought I would return to Support Vector Machines. These are primarily classifiers which assign data to one of two categories. E.g. in the picture below, red or blue. Having read up on elementary vector geometry, and more optimisation stuff (through economics) I found the subject much more penetrable.

PyML is pretty easy to get hold of and install. Don’t expect much in the way of documentation though. These are my notes on how to wring something visible out of it as a ‘Hello World’ use of SVM. Now after writing some code to generate a data set (more on that below), the following few lines get us to a visible output:

from PyML import *
from PyML.demo import demo2d
mu=[array([1,-2,3,-3]),array([-2,3,-1,3])]
data = transformed_gauss( 150, f2, mu )
data.attachKernel( 'Gaussian' )
s = SVM()
s.train( data )
demo2d.setData( data )
demo2d.decisionSurface( s )

 

You can then use cross validation method to get an estimate on the classifier’s performance:

In [36]: s.cv(data)
[...]
Confusion Matrix:
      Given labels:
       0    1
    0  52   23
    1  22   53

Here’s the code used to generate the data. I wanted something a bit messier than the inbuilt data, and something amenable to 2d visualisation.  The code is to generate a set of 4-dimensional Gaussians, and then map them on to two dimensions. My thought was to take a data set that is linearly separable in its original dimensionality, distort it down, then see how easily SVM can restore the separation.

from pylab import *
from PyML import *
from PyML.demo import demo2d

def gauss_data( N, mu ):
    p = mu[0].shape[0]
    N = N + N%2 #make N divisible by two
    X = []
    Y = []
    for i in range(N):
        #select a random class
        class_index = randint(0,2)
        #create a new X point -> a p-dimensional Gaussian with mean of that class
        X.append( randn(p) + mu[class_index] )
        #create the new Y point -> 1 if from mu[1], -1 if from mu[0]
        Y.append( str(class_index) )
    return X, Y

f2 = lambda x: array([ sin(x[0] + x[1] ), cos( x[2] + x[3] ) ])

def transform( X, f ):
    return [f(x) for x in X]

def transformed_gauss( N, f, mu):
    X, Y = gauss_data( N, mu )
    Z = transform( X, f )
    D = VectorDataSet( Z, L=Y )
    return D

My other tip on PyML is that it likes to contstruct DataSet instances with the label list as strings.

Some pictures of simple differential equation systems

This week I have been mostly wondering what simple systems of differential equations look like. The pictures in the books often either have too many arrows or too few.  There’s something aesthetically unappealing about it. Being a humanities person, I want to see the stories of individual points. Just thought I’d share them.

Here’s a stable set of equations:

[begin{equation} dot{x}=-2x+y \ dot{y}=x-2yend{equation}]

The lines on the graph represent $$dot{x}=0, dot{y}=0$$ . Hence, where they meet is an equilibrium point, which may be stable or unstable.

Then there’s it’s more awkward cousin:

[begin{equation} dot{x}=10+2x-3y \ dot{y}=9+x-2yend{equation}]

Zooming out a bit (not using arrow method), but before it turns into a straight line zipping off to infinity:

I particularly like this damnably simple, stable spiral:

[begin{equation} dot{x}=-x+y \ dot{y}=-x-yend{equation}]

And here’s its badly behaved cousin, flinging points out all over the show:

[begin{equation} dot{x}=-2x+y \ dot{y}=x+2yend{equation}]

Code on github, should you wish to play.

Acme Rink Company Ltd

Continuing on my quest to find start-up fads of times past. I wanted to move past the tactic of just ranking based on keyword count. Here’s the metric I’m testing:

[begin{equation} M= frac{(N-1)^b}{(H+0.001)^a}, 0<a,b<1 end{equation}]

Where H is the information entropy of the probability distribution of the keyword conditioned on the year of incorporation. So H is smaller if the event is more predictable, say you know there was a massive boom in chip shops during 1922. Hence it is dividing, as we want low entropy keywords to rank high (for now).  N is the keyword count, so the rank metric is increasing with N.

Using a=1, b=0.5, these are the twenty highest ranked keywords:

0) Rink
1) Skating
2) Coy
3) Greyhound
4) (1920)
5) Tavern
6) Wireless
7) Cinema
8) Sailing
9) Radio
10) Exploring
11) Oilfields
12) Columbia
13) Aircraft
14) Picture
15) Golden
16) Son,
17) Mines,
18) Theatres
19) Reefs

Investigating ‘Rink’, it’s clear when the skating expansion takes place:

For those interested here’s some of the names.  Prize goes to Acme Rink Company Limited, 1892

The Sports Arenas and Ice Rinks Construction Corporation Ltd, 1928
Rink Equipment Company Ltd, 1926
Acme Rink Company Ltd, 1892
Crystal Ice Rink, Ltd, 1891
Brighton Rink Syndicate Ltd, 1896
Savoy Rink Ltd, 1928
Palais Roller Rink (Hull) Ltd, 1929
Keighley Skating Rink Ltd, 1929
Hinckley Roller Skating Rink Ltd, 1930
Rink (Bishop Auckland) Ltd, 1930
Billy-Jeans Ice Rinks Ltd, 1972
Edinburgh Skating Rink Company Ltd, 1908
Sunderland Skating Rink Company Ltd, 1908
Belfast Skating Rink Company Ltd, 1908
Leeds Skating Rink Company Ltd, 1908
Glasgow Skating Rink Company Ltd, 1908
Dublin Skating Rink Company Ltd, 1908
Birmingham Skating Rink Company Ltd, 1908
London Olympia Skating Rink Company Ltd, 1908
St James's Hall, Manchester, Skating Rink Company Ltd, 1908

Black Swan Gold Mine Ltd

At Hack on the Record I took a big chunk of Board of Trade data on historic company incorporations (hat-tip @Baloun).

The end goal is to identify past start up fads and bubbles by looking at keywords. Probably through rankings based around minimum information entropy. Thought it could provide a interesting way to do a spot of economic history.

So, I was pretty pleased when, taking the first 5000 incorporations off the top, the term ‘gold’ appeared high in the ranks, with a spike in 1896.

It’s the mixed effects of the Klondike gold rush, Cecil Rhodes, and hunting the golden bunyip in Western Australia. Sitting there in the data like a pure nugget in a clear mountain stream. No fancy smelting processes needed.

Here’s the unfiltered keywords for 1896:

Company, 375
and, 240
Syndicate, 139
Gold, 97
Mines, 51
Mining, 34
Corporation, 25
London, 23
Club, 23
Development, 22
New, 18
Exploration, 18
Investment, 17
Steamship, 16
Cycle, 15
British, 15
of, 14
Publishing, 14
General, 13
Association, 13
Steam, 12
United, 11
South, 11
Brick, 11
Zealand, 10
Manufacturing, 10
African, 10
Works, 9
W, 9
Trust, 9
J, 9
Explorers, 9
City, 9
West, 8
Patent, 8
Laundry, 8
Gas, 8
Finance, 8
F, 8
Brothers, 8
Universal, 7
Tile, 7
Supply, 7
Sons, 7
Mine, 7
James, 7
H, 7
Creek, 7
Colonial, 7
Colliery, 7

Here are the names of the companies involved, some good fun in here:

Cobar Gold Mines Ltd
Kootenay Gold Fields Syndicate Ltd
New Zealand Gold Development Syndicate Ltd
Rooderand Main Reef Gold Mining Company Ltd
Seine River (Ontario) Gold Mines Ltd
Chili Gold Gravels Ltd
Towranna Gold Mines of Western Australia Ltd
Hannans "Empress" Gold Mining and Development Company Ltd
Candelaria Gold Mines Ltd
Lucky Guss Gold Mine Ltd
Hauraki (N Z) Associated Gold Reefs Ltd
Hannan's Premier Gold Mines Ltd
Summit Flat Gold Mines Ltd
Good Luck Gold Properties Ltd
Associated Southern Gold Mines (W A) Ltd
Gold Securities Ltd
Rockhampton (Queensland) Gold Estate Ltd
Truer River Gold Mining Company Ltd
Antenior (Matabelle) Gold Mines Ltd
90-Mile Proprietary Gold Mines Ltd
Captain Robinson's Gold Reefs Ltd
Waitekauri Cross Gold Mining Company Ltd
New Zealand Gold Investment Company Ltd
Gullewa Gold Mines Ltd
Merced Monster Gold Mines Ltd
Waihi Consolidated Gold Mines Ltd
Irassu Gold Exploration Syndicate Ltd
Menzies Golden Rhine Gold Mines (WA) Ltd
Hannans Gold Hill Ltd
Lady Margaret Gold Mining Company Ltd
Princess Alix Gold Mines Ltd
Renmark Gold Mines Ltd
Morris Ravine Gold Mines Ltd
Lone Ridge Gold Mine Ltd
Easter Gift Proprietary Gold Mines Ltd
White Flag Consols Gold Mines Ltd
Selukwe Gold Mining Company Ltd
Gold Reefs of Western Australia Ltd
Gold Mines Corporation Ltd
Hannans Mount Ferrum Gold Mines Ltd
Westralia and New Zealand Gold Explorers Ltd
Nil Desperandum Gold Mines Ltd
Regina (Canada) Gold Mine Ltd
Lake View and Boulder Junction Gold Mines Ltd
Kinsella Gold Mines Ltd
Armadale Gold Mining Company Ltd
Lynx Creek Gold Mining Company Ltd
Kurnalpi Gold Exploration and Development Company (W A) Ltd
Oliphants' Olei Gold Mining Company Ltd
Universal Gold Syndicate Ltd
Norseman Gold Mines Ltd
Pinnacles Gold Mine Ltd
Hannan's Queen Gold Mines Ltd
City of London Gold Mines Ltd
Lady Maude Gold Mines Ltd
Bingham's Randfontein Gold Mining Company Ltd
Shamrock Gold Mining Company Ltd
Herbert Gold Ltd
Utah Consolidated Gold Mines Ltd
General Gordon Gold Mines Ltd
Rose-Hill United Gold Mines Ltd
Joker (Yalgoo) Gold Mines Ltd
Lady Emily Gold Mining Company Ltd
Bellibetta Gold Company Ltd
Victoria Reef Gold Mine Ltd
African Daspoort Gold Mines Ltd
Lochinvar Gold Mines Ltd
Golconda Gold Mines Ltd
Seven Sisters Gold Mines Syndicate Ltd
Moel Offrwm Gold Mining Company Ltd
All Nations Gold Mines Ltd
Cripple Creek Gold and Exploration Ltd
Lady Evelyn Gold Mines Ltd
Hannans United Gold Estates Ltd
Western Star Gold Mining Company Ltd
Corsair Consolidated Gold Mines Ltd
Elandsfontein No 2 Gold Mining Company Ltd
Sunbeam and Vigilant Gold Mines Ltd
Black Swan Gold Mine Ltd
"Hesperus" Gold Mining Company Ltd
Gold Mining Association Ltd
Mount Hepburn Gold Mine Ltd
Unionist Gold Mining Syndicate Ltd
Rhodesian Gold Properties Ltd
Hauraki Gold Properties Ltd
Santa Anna Gold Mining Company Ltd
Menzies Gold Development Company Ltd
British Columbia Gold Syndicate Ltd
Woodleys Reward Gold Mines Ltd
Huttons (Bechuanaland) Gold Reefs Development Company Ltd
Anglo-Rhodesian Gold Mining and Engineering Company Ltd
Mount McDonald Gold Mines Ltd
Great Victoria Gold Mining Company Ltd
Bunyip Gold Mines Ltd
Hikutaia Gold Syndicate Ltd
Dorothy Gold Mining Company Ltd
New Alburnia Gold Mining Company Ltd

Can’t wait to see what stories are lurking in the other 175k records!