2 Insanely Clever Tips & Tricks for Running Spark on Hadoop
After some interesting industry banter about whether Hadoop or Spark would inevitably rule the universe of big data analytics, it’s decided. There’s room for both, and in many situations, both are needed. Spark and Hadoop work better together. So, now all you need is some valuable insight for running Spark on Hadoop for the total package in data analytics. You’re welcome.
1. Pick the Right Execution Mode for Your Workload
There are three modes to choose from once you get Spark loaded into Hadoop.
• Local mode is used to launch one Spark shell with all of the related components in the same JVM. This mode is ideal for debugging operations.
• YARN-cluster mode runs the Spark driver inside a Hadoop cluster as a YARN Application Master. In this mode, the Spark executors are spun up inside the YARN containers. This mode lets Spark’s apps run inside a cluster yet remain totally decoupled from the workbench. In this mode, the workbench is really just used to submit jobs.
• YARN-client mode means the Spark driver runs on the actual workbench. In this mode, YARN Application Manger serves in a reduced capacity. It only requests resources from YARN in order to make sure Spark stays within a single Hadoop cluster inside the YARN container.
The local and YARN-client modes both basically launch shells of Spark so that the user can use Scala commands and codes. For initial and experimental deployments, these modes are ideal. The two differ in how they utilize computing power: local mode limits Spark to just the computer, while YARN-client takes commands from the Spark shell but leaves the computing work to Spark executors, which run on nodes within the Hadoop cluster. Eventually, the workflow will consolidate into a Scala program.
2. Pick the Right Persistence Option for Your Workload
There are several persistence options to choose from, as well.
• Memory only is set as default. In memory-only, if there isn’t enough memory available, some partitions won’t be cached but will appear on demand.
• Memory and disk option keeps extra partitions on disk to be read when necessary in the event that memory is insufficient.
• Serialized memory only is a more efficient option in terms of saving space, but does require more overhead in terms of CPU. It stores RDD partitions as serialized Java objects.
• Disk only stores RDD partitions only in disk.
These options all mean some sort of trade-off among memory use, network resources, and CPU utilization. If you choose the ‘memory and disk’ option, if there isn’t enough memory available, the Spark app might experience less than optimal performance while RDD partitions that aren’t stored in memory have to be found and retrieved from persistent storage. Likewise, if you’re using a serializer with ‘serialized memory’ or ‘disk only’, you might get better transfer times, but there will be more CPU overhead. Use this guide to find the ideal balance for your workloads.
Have you considered how a data lake could help with your Hadoop/Spark needs? For a very limited time, you can discover the first Full Metal Data Lake as a Service in the world. Get 1TB free for life - limited to 100 applicants. Start here.