Technically Speaking

The Official Bigstep Blog

Subscribe to our newsletter

How to Choose the Right Programming Language for Your Big Data Initiatives

So, you have big plans for big data. You’ve picked out a lovely infrastructure and it’s time to get started. But one question remains: which language will you inflict on, we mean, insist that your developers and data scientists use? The reigning champs these days are R, Python, Scala, SAS, the Hadoop languages (Pig, Hive, etc.), and of course, Java. At last count, a scant 12 percent of developers working with big data projects chose to use Java.

Almost half of all big data operations are driven by code programmed in R, while SAS commanded just over 36 percent, Python took 35 percent (down somewhat from the previous two years), and the others accounted for less than 10 percent of all big data endeavors. So, let’s focus on the movers and shakers: R, Python, Scala, and Java.


R is a statistician’s language. But data scientists can master it given time. They just have to first get up to speed on Matlab, SAS, or perhaps OCTAVE. R is powerful for data analytics, but it isn’t so strong as a general-purpose language. For example, you can put together a good model using R, but you would probably end up translating it into Python or Scala before putting it into production, anyway, so it might just be best put it in one of those languages to begin with. R isn’t practical for things like writing a clustering control system, because the debugging process would be nightmarish.


Most data scientists will already be intimately familiar with Python, so if you’re hiring those as part of your data initiative, this will be a natural fit. If you’re home-growing a development team to handle your big data operations, Python is relatively easy to learn (it’s just another object-oriented language for developers to master) and also has the distinct advantage of being easy to read by humans. On the flipside, while most big data processing frameworks do support Python, it’s somewhat of the redheaded stepchild of big data languages. This means that all the fancy new features in products like Apache Spark might only be offered in Scala or Java first, while the Python crowd has to wait out a few version updates to get their hands on it. If it isn’t crucial to always be the first to have all the snazzy bells and whistles, then Python is likely your kind of gig.


Scala is a part of the JVM (Java Virtual Machine) ecosystem, which makes it potent and flexible out of the gate. It’s also an admirable pairing of functional and objected-oriented languages, and is tremendously popular in the financial industry, where companies need to be able to work with massive sets of widely-distributed data (think social media-level volumes and distribution). Kafka and Spark are both powered by Scala, and you can do lots more with far less code in Scala than in Java. As is so often the case, one of Scala’s strengths is also its greatest weakness. This language provides multiple ways to do the same things, which means that you have to employ strong practices and guidelines to keep your Scala code from getting out of hand. At first glance, poorly done Scala code can resemble a really nasty section of Pearl. Some developers also complain that Scala is a tad on the slow side.


Just a dozen or so lines of Scala code can easily balloon into a couple hundred lines of Java code. However, the latest version has gone a long way toward improving that. It won’t ever be as lean and mean as Scala, but it has other advantages to consider, like its natural habitats in Hadoop and other big data frameworks and tools.

Java takes some serious PR hits, and if it were a cartoon character, it would probably be the Charlie Brown of big data development languages. In fact, when it comes to products like MapReduce, HDFS, Storm, Kafka, Spark, Apache Beam (all part of the JVM ecosystem), Java moves from the Sheldon Cooper of languages to become Johnny Depp. Aside from the mongo collections of debugging tools, monitoring tools, libraries, and profilers Java gives you access to, no language has been tested, revised, and proven itself like Java. The biggest problem with it is that it’s ridiculously verbose, and there is no REPL for iterative development, as there is with the others on this list. Java 8 has, however, come a long way in terms of verbosity, and while it will never be as streamlined as Scala, it is much better.

So, which language should you choose? Well, that all depends. When it comes to hardcore analytics and working with obscure statistical methods, R is your man. When working with intense neural network processing across numerous GPUs, Python is an obvious choice. For a solid production streaming solution, Scala is less verbose, while Java is battle-tested and proven, so choose your weapon. Many IT departments come to the conclusion that a one-two punch of R plus Python is the answer.

Don’t nail down your big data solutions until you’ve seen what the Full Metal Cloud has to offer. See our products here.

Got a question? Need advice? We're just one click away.
Sharing is caring:TwitterFacebookGoogle+PinterestEmail

Readers also enjoyed:

Learning to Live with (and Overcome) Hadoop's Flaws

When it comes to managing big data, no system can match Hadoop in terms of working with huge data sets comprised of structured data, unstructured data,…

Is Virtual Reality a Viable Tool for Your Brand?

The video gaming industry was quick to jump on the virtual reality bandwagon, offering gamers a chance to really get into the game and pretend the action…

Leave a Reply

Your email address will not be published.

* Required fields to post your comments.
Please review our Privacy Notice in order to understand how we process your personal data and what are your rights in this respect.