December 23, 2015

Why Spark Isn't Going to Dethrone Hadoop Anytime Soon

Once upon a time, Yahoo! needed a way to scale out storage, along with the ability to parallelize tasks, without emptying the king's coffers. The king's men ended up developing HDFS and MapReduce. This was the beginning of Hadoop, and for quite some time, it appeared as if Hadoop would be THE platform for managing big data. Vendors came to pay homage to the king. Open source projects ran wild. The future was good.

Once upon a time, Yahoo! needed a way to scale out storage, along with the ability to parallelize tasks, without emptying the king’s coffers. The king’s men ended up developing HDFS and MapReduce. This was the beginning of Hadoop, and for quite some time, it appeared as if Hadoop would be THE platform for managing big data. Vendors came to pay homage to the king. Open source projects ran wild. The future was good.

Along came Spark. An Apache project, Spark emerged as a framework to perform general data analytics tasks on distributed computing clusters, including the popular Hadoop. It conducts in-memory computations much quicker than MapReduce can, and is capable of processing structured data as well as streaming data (such as the Twitter Firehose). Now, instead of crowning Hadoop as the unquestioned prince of the land, many are calling for his head, claiming that Spark is the way of the future.

What are the loyal subjects, desperate for big data analytical solutions, to do? Will Hadoop rule supreme, or will Spark dethrone the crown prince and abscond with the royal jewels?

All the King’s Horses and All the King’s Men

Don’t look for either of these dueling knights to fall to the other anytime soon. There’s still a place in the kingdom for both platforms, often in tandem.

The secret is, there’s room for both. In fact, the two play together quite nicely when not pitted against one another in a javelin match. While MapReduce on Hadoop lays claim to a powerful horse, Spark owns the faster of the beasts. Hadoop typically takes at least a few minutes, and often several hours, to complete a computational task. Spark can be used for real-time streaming and quick queries that are complete within seconds. In fact, Spark is often running on Hadoop. Hadoop is actually a general-purpose framework that is well capable of conducting the deep, powerful, slow MapReduce tasks as well as the speedy Spark jobs.

If Spark is quicker, why not just abandon MapReduce altogether? Spark is a RAM hog. It takes a lot of machine to keep up with Spark’s transactions, whereas MapReduce on Hadoop is much leaner. Plus, Spark does not have its own distributed file organizing system. It requires a third-party solution for that. So, a lot of big data projects end up installing Spark right on top of Hadoop, so that Spark can contribute advanced analytics while leveraging the Hadoop Distributed File System (HDFS).

A Consolidated Solution

In order to knock the other out of contention, Spark would need to become a more complete solution and get a lot leaner in RAM usage. Hadoop would need to pick up some serious speed.

Spark holds data in memory during processing, while Hadoop keeps the data on the disk. Hadoop achieves fault tolerance via replication, while Spark takes advantage of RDD (Resilient Distributed Datasets). If an RDD partition is lost, it has enough information to reconstruct that partition. That means there is no need to replicate in order to achieve fault tolerance. Spark is ideal for machine learning, data mining and processing, a faster data warehousing system, stream processing, log processing, fraud detection, sensor data processing, and other fast processing work.

Together, they create a complete solution, and neither are slated to be banished from the kingdom anytime in the foreseeable future. Would you like to see how other real world businesses solved their Hadoop versus Spark debates? See our customer stories. Whatever you decide, Bigstep has a solution for you in the Full Metal cloud.

Got a question? Need advice? We're just one click away.

Sharing is caring:

Back to articles

Readers also enjoyed:

June 4, 2014

Putting a value to big data

By Daniela Mustatea in Big Data Use Cases

Most businesses have woken up to the fact that big data is an area in which they need to invest in to realise its rich potential. Finding the investment…

December 16, 2014

Hadoop Used to Predict Flight Delays

By Daniela Mustatea in Hadoop

Hadoop consolidates large volumes of information, stores it efficiently, processes it powerfully, and does all this inexpensively. This makes Hadoop an…

April 6, 2015

How to Get Everyone on Board with Your Big Data Initiatives

By Daniela Mustatea in What is Big Data

You've read the blogs and seen the news: big data is here and you're convinced it's an excellent solution for your organization. But not everyone agrees.…

Your email address will not be published.