October 10, 2016

2 Insanely Clever Tips & Tricks for Running Spark on Hadoop

After some interesting industry banter about whether Hadoop or Spark would inevitably rule the universe of big data analytics, it's decided. There's room for both, and in many situations, both are needed. Spark and Hadoop work better together. So, now all you need is some valuable insight for running Spark on Hadoop for the total package in data analytics. You're welcome.

After some interesting industry banter about whether Hadoop or Spark would inevitably rule the universe of big data analytics, it’s decided. There’s room for both, and in many situations, both are needed. Spark and Hadoop work better together. So, now all you need is some valuable insight for running Spark on Hadoop for the total package in data analytics. You’re welcome.

1. Pick the Right Execution Mode for Your Workload

Hadoop and Spark together are quite powerful. Unfortunately, we haven’t gotten to the point that using them is as easy as pressing a button.

There are three modes to choose from once you get Spark loaded into Hadoop.

• Local mode is used to launch one Spark shell with all of the related components in the same JVM. This mode is ideal for debugging operations.
• YARN-cluster mode runs the Spark driver inside a Hadoop cluster as a YARN Application Master. In this mode, the Spark executors are spun up inside the YARN containers. This mode lets Spark’s apps run inside a cluster yet remain totally decoupled from the workbench. In this mode, the workbench is really just used to submit jobs.
• YARN-client mode means the Spark driver runs on the actual workbench. In this mode, YARN Application Manger serves in a reduced capacity. It only requests resources from YARN in order to make sure Spark stays within a single Hadoop cluster inside the YARN container.

The local and YARN-client modes both basically launch shells of Spark so that the user can use Scala commands and codes. For initial and experimental deployments, these modes are ideal. The two differ in how they utilize computing power: local mode limits Spark to just the computer, while YARN-client takes commands from the Spark shell but leaves the computing work to Spark executors, which run on nodes within the Hadoop cluster. Eventually, the workflow will consolidate into a Scala program.

2. Pick the Right Persistence Option for Your Workload

The various modes allow you to determine how resources are distributed within the environment, based on the workloads you need to run.

There are several persistence options to choose from, as well.

• Memory only is set as default. In memory-only, if there isn’t enough memory available, some partitions won’t be cached but will appear on demand.
• Memory and disk option keeps extra partitions on disk to be read when necessary in the event that memory is insufficient.
• Serialized memory only is a more efficient option in terms of saving space, but does require more overhead in terms of CPU. It stores RDD partitions as serialized Java objects.
• Disk only stores RDD partitions only in disk.

These options all mean some sort of trade-off among memory use, network resources, and CPU utilization. If you choose the ‘memory and disk’ option, if there isn’t enough memory available, the Spark app might experience less than optimal performance while RDD partitions that aren’t stored in memory have to be found and retrieved from persistent storage. Likewise, if you’re using a serializer with ‘serialized memory’ or ‘disk only’, you might get better transfer times, but there will be more CPU overhead. Use this guide to find the ideal balance for your workloads.

Have you considered how a data lake could help with your Hadoop/Spark needs? For a very limited time, you can discover the first Full Metal Data Lake as a Service in the world. Get 1TB free for life - limited to 100 applicants. Start here.

Got a question? Need advice? We're just one click away.

Sharing is caring:

Back to articles

Readers also enjoyed:

April 15, 2016

3 Predictions for Hadoop & Big Data Between Now and 2021

By Daniela Mustatea in Hadoop

The past few years have brought some growing pains for big data and Hadoop. They had to fight the reputation that they were just buzzwords and had nothing…

December 2, 2016

4 Ways You're Going to Succeed with Big Data & Digital Transformation in 2017

By Daniela Mustatea in What is Big Data

Flexibility. Scalability. Agility.These are just a few of the many buzzwords making the rounds on the business and tech blogs to define what it means…

September 17, 2015

Music + Big Data = Music Science

By Daniela Mustatea in Big Data Use Cases

Music and statistics have gone hand in hand since the beginning of radio. But lately digitization, the decline of analog devices and the rise of analytics…

Your email address will not be published.