Need a way to picture Big Data? Start working with it!

2013-07-17

by Alexander Puschilov

Big data has many dimensions: Variables, Disk Space, Processing Time etc. The best way to gain an intuition on these, is to actually work with it.

Fact: It’s super hard to imagine very large numbers.

Take a moment and try to picture a mountain that is 50,000 km high, or a pile of $20 billion in $50! Close your eyes for a moment and try to visualize that giant rock or massive mountain of cash.

And with massive amounts of bits and bytes, it’s no difference. What does it really mean to have millions of observations to process? I find it truly mindblowing to imagine that a priory.

But it gets so much easier once you start working with it. After completing the ML Coursera course, I wanted to get my hands dirty on an artificial “real” task. And what better way to start, then participating in a Kaggle competition?

I’ve chosen to work on the digit recognizer problem in R. It’s not a particularly large data set by internet standards, comprising of only 42,000 observations. But the original input matrix has a dimension of 42000 x 784.

As I am trying to implement a solution using Neural Networks, the (batch) algorithm ended up performing a matrix multiplication of (42000 x 784) %*% (784 x # of hidden units) in every iteration.

With only 100 hidden units, I’ve ended up multiplying 2.581.555.200.000 numbers. And good solutions for this problem require hundreds of hidden units and hundreds of iterations.

Finding a name for this huge number (two and a half trillion?), much less visualizing it is pretty hard. But after trying to run a batch network in all my naiveté, it wasn’t hard at all: It just took a lot of time!

It was basically running and running and running for days and nights until I couldn’t bear it any longer. (OK maybe the Macbook Air is not the perfect device for everything…

I’ve quickly moved away from batch gradient descent and also drastically reduced the dimensions of the input matrix using KNN and column variance. Now I’m in the last steps of tuning mini batch gradient descent and parallelizing it.

Image attributions: blmiers2, CC; Cheeckyneedle, CC

Other Articles