Last Updated: February 25, 2016
· eigenhector

Apache Spark vs Mapreduce for Machine Learning

Apache Spark vs Mapreduce for Machine Learning https://www.linkedin.com/today/post/article/20140517214651-1937667-apache-spark-vs-mapreduce-for-machine-learning

I have recently begun using Apache Spark to build a machine learning pipeline and I must say it is very nice to work with. The large scale recommender systems I have built in the past for Youtube and Google Play were built on top of Google's Mapreduce framework. Mapreduce is great for a lot of large scale data processing but it has some drawbacks such as not being too good for iterative jobs like machine learning. Most machine learning algorithms have a phase which is data parallel, such as computing the gradient over a large data set, followed by a step where the gradients are summed up on a single machine and then broadcast out to the workers for more gradient computation. In most Mapreduce implementations this step is slow as the model is written to disk and the workers re-load the training data for each iteration.

Spark's data processing model on the other hand permits users to load up the data once and cache it in RAM. The first iteration is very much like Mapreduce -- the training data is loaded over the network to the worker machines, the gradients are computed and then summed on a single machine, the reducer. After that step, this is where Spark differs from mapreduce - the updated model is sent back to the workers via network (as opposed to saving to disk) and the training data is already loaded on the workers, thus speeding up subsequent iterations tremendously!

I really like the data processing framework of Spark. Technology in the open source world has certainly changed a lot since I responded to the "Is Google Infrastructure Obsolete" question on Quora!