5 Useful Things to Know Before Choosing Between Hadoop & Spark
There’s a lot of “Hadoop versus Spark” type articles and blog posts circulating in the tech journals, but inevitably it doesn’t have to be a competition. These two platforms coexist nicely, and also work well for entirely different things. Ironically, this stiff competition, whether real or merely perceived, was born out of an industry need.
One of Hadoop’s creators gave a lecture at the University of California-Berkeley (it was handed over to Apache later on). During his talk, he discussed some of the shortcomings of the Hadoop platform. From that discussion, the students created Spark to address those shortfalls. That’s how Spark came to be: it was brought into being specifically to address the gaps in what Hadoop could do.
However, there’s no need to create an overly complex data infrastructure; if you only need one, by all means stick with that one. Here are the most important things to know before delving into your decision.
1. You Don’t Actually Have to Choose, You Know
Hadoop is a distributed infrastructure for data storage. It is built to distribute humongous quantities of data across many nodes within a cluster of servers. It eliminates the need for buying and maintaining custom hardware, which is the most expensive and troublesome part of owning and managing lots of data. Hadoop also has means for indexing and tracking data, and the latest versions have more processing and analytical capabilities than the earlier editions.
Conversely, Spark is a data processing framework that works on distributed storage, but is not itself a distributed storage solution. So, you can use both. Spark can run across the Hadoop infrastructure, allowing you to leverage the real-time processing of Spark together with the batch processing Hadoop is known for.
2. They Also Work Well Separately, Thank You Very Much
But it isn’t necessary to run Spark on the Hadoop infrastructure. Hadoop features HDFS, or the Hadoop Distributed File System, which is a storage solution. But it also includes MapReduce, which is its native data processing solution. Since it’s a part of the Hadoop ecosystem, Spark doesn’t have its own file management system, but it will work just as well from a cloud-based data platform.
3. Need Speed? Go Spark
As we mentioned, Spark is the speedier of the two. In fact, that’s one of the primary shortcomings of Hadoop that Spark’s creators at Berkeley sought to correct. Spark is used primarily for real-time processing, such as is needed for online financial transactions. Hadoop is a much slower batch processing tool. The speed has to do with the fundamental design of these tools—Hadoop was never designed to be the speed demon that Spark was. Depending on many different factors, Spark can be from 10 times to 100 times faster than MapReduce on Hadoop.
4. Not Everybody Needs the Speed, Maverick
While many applications demand that speed, it isn’t always necessary. For general data queries and reporting purposes, Hadoop’s slow-munching batch processing is fine. Spark’s speedier analytics is only necessary for things like streaming real-time data from manufacturing processes, website transactions, threat detection and machine-learning applications.
5. Hadoop or Spark or Both: Stick ‘em in the Cloud
Both Hadoop and Spark can run on a cloud platform. Spark as a Service is available, as is a cloud-based Hadoop infrastructure. This makes setting up and managing a big data infrastructure immensely easier and cheaper than doing so onsite. See how Bigstep’s customers have made it work and what their big data use cases are when you read our customer stories.
You can do big data, and Bigstep can help.