Need a way to picture Big Data? Start working with it!

Fact: It’s super hard to imagine very large numbers. Take a moment and try to picture a mountain that is 50,000 km high, or a pile of $20 billion! Close your eyes for a moment and try to visualize that giant rock or massing mountain of $$$. And with massive amounts of bits and bytes, it’s no difference. What it means to have millions of observations to process, is just mindblowlingly tough to image a priory. But it gets so much easier once you start working with it.

After completing the ML coursera course I wanted to get my hands dirty on an artificial “real” task. And what better way to start then participating in a Kaggle competition. I’ve chosen to work on the digit recognizer problem in R. It’s not a particularly large data set by internet standards comprising of only 42,000 observations. But the original input matrix has a dimension of 42000 x 784.

As I am trying to implement a solution using Neural Networks, the (batch) algorithm ended up performing a matrix multiplication of (42000 x 784) %*% (784 x # of hidden units) in every iteration. With only 100 hidden units we end up multiplying 2.581.555.200.000 numbers. And good solutions for this problem require hundreds of hidden units and hundreds of iterations.

Finding a name for this huge number (two and a half trillion?), much less visualizing it is pretty hard. But after trying to run a batch network in all my naivite, it wasn’t hard at all: PRETTY DAMN LOOOONG! It was basically running and running and running for days and nights until I couldn’t bear it any longer. (OK maybe the Macbook Air is not the perfect device for everything…)

I’ve quickly moved away from batch gradient descent and also drastically reduced the dimensions of the input matrix using KNN and column variance. Now I’m in the last steps of tuning mini batch gradient descent and parallelizing it. More about it soon.

{If you’ve come this far, why don’t you share your ideas and experience in the comments and follow me on Twitter┬áto discuss this topic and receive more information on data, startups, & learning.}

Image attributions: blmiers2, CC; Cheeckyneedle, CC