Why Spark Isn't Going to Dethrone Hadoop Anytime Soon
Once upon a time, Yahoo! needed a way to scale out storage, along with the ability to parallelize tasks, without emptying the king’s coffers. The king’s men ended up developing HDFS and MapReduce. This was the beginning of Hadoop, and for quite some time, it appeared as if Hadoop would be THE platform for managing big data. Vendors came to pay homage to the king. Open source projects ran wild. The future was good.
Along came Spark. An Apache project, Spark emerged as a framework to perform general data analytics tasks on distributed computing clusters, including the popular Hadoop. It conducts in-memory computations much quicker than MapReduce can, and is capable of processing structured data as well as streaming data (such as the Twitter Firehose). Now, instead of crowning Hadoop as the unquestioned prince of the land, many are calling for his head, claiming that Spark is the way of the future.
What are the loyal subjects, desperate for big data analytical solutions, to do? Will Hadoop rule supreme, or will Spark dethrone the crown prince and abscond with the royal jewels?
All the King’s Horses and All the King’s Men
The secret is, there’s room for both. In fact, the two play together quite nicely when not pitted against one another in a javelin match. While MapReduce on Hadoop lays claim to a powerful horse, Spark owns the faster of the beasts. Hadoop typically takes at least a few minutes, and often several hours, to complete a computational task. Spark can be used for real-time streaming and quick queries that are complete within seconds. In fact, Spark is often running on Hadoop. Hadoop is actually a general-purpose framework that is well capable of conducting the deep, powerful, slow MapReduce tasks as well as the speedy Spark jobs.
If Spark is quicker, why not just abandon MapReduce altogether? Spark is a RAM hog. It takes a lot of machine to keep up with Spark’s transactions, whereas MapReduce on Hadoop is much leaner. Plus, Spark does not have its own distributed file organizing system. It requires a third-party solution for that. So, a lot of big data projects end up installing Spark right on top of Hadoop, so that Spark can contribute advanced analytics while leveraging the Hadoop Distributed File System (HDFS).
A Consolidated Solution
Spark holds data in memory during processing, while Hadoop keeps the data on the disk. Hadoop achieves fault tolerance via replication, while Spark takes advantage of RDD (Resilient Distributed Datasets). If an RDD partition is lost, it has enough information to reconstruct that partition. That means there is no need to replicate in order to achieve fault tolerance. Spark is ideal for machine learning, data mining and processing, a faster data warehousing system, stream processing, log processing, fraud detection, sensor data processing, and other fast processing work.
Together, they create a complete solution, and neither are slated to be banished from the kingdom anytime in the foreseeable future. Would you like to see how other real world businesses solved their Hadoop versus Spark debates? See our customer stories. Whatever you decide, Bigstep has a solution for you in the Full Metal cloud.