October 15, 2014

4 Lessons Learned from Yahoo's Massive Hadoop Cluster Setup

Yahoo! has become the largest user of Hadoop, establishing a cluster setup comprised of 10,000 CPUs located in more than 40,000 servers with 4,500 nodes. This Hadoop cluster setup manages 455 petabytes of data for Yahoo!, which amounts to more than 1,820 times the amount of data held by the Library of Congress. Obviously, and undertaking of this magnitude leads to a lot of learning on the issue, and here is what can be gleaned from Yahoo!'s transition to Hadoop clusters.

When big data sets include both structured and unstructured data, Hadoop manages it efficiently.

1. Hadoop is Useful for Combining Different Types of Data

When enterprises hold a wide variety of data types, such as structured and unstructured data sets like Yahoo! has, Hadoop is an excellent way to manage it. Hadoop easily handles structured data sets, like data within ERP systems, alongside semi-structured data sets like file logs, and completely unstructured data like videos. Once on a Hadoop cluster, it is easy to add to data sets and amend existing data, which is important for enterprises like Yahoo! that are constantly updating their body of data with new information on page views, click streams, photos, videos, and much more.

2. Hadoop is Useful for Sharing Data

Not only can Hadoop clusters hold these enormous mixed data sets, it is also an easy way to house it in a single place for collaboration. Yahoo! has a variety of data that needs to be accessible by large teams of workers and still remain relatively secure. Yahoo! uses YARN, formerly called MapReduce, to run a full range of jobs within those data sets. Currently, Yahoo! has 32,000 nodes within 16 clusters that run YARN.

3. Hadoop With YARN is Useful fir Getting New Employees Started Immediately

According to Peter Cnudde, VP of Engineering for Yahoo!, new employees can come in and become productive immediately using Hadoop, especially when running YARN. As it becomes harder and harder to find qualified talent, this is an excellent way to stay productive when there aren’t lots of candidates out there who have much experience working with enormous data sets like Yahoo! manages.

Yahoo! uses Hadoop for lots of things, but not as an email or photo server.

4. Hadoop Can’t Replace Some Server Operations

Yahoo! didn’t move all server processes to Hadoop clusters, however. Though they use Hadoop to scan their email for spam, they do not use it as the email server. Also, Yahoo! uses Hadoop to run image recognition on Flickr photos, but does not use it as their photo server.

When is Hadoop not the best solution? A Hadoop cluster setup isn’t the best option for handling real-time analytics, because it’s simply not that fast. It also isn’t the best solution for smaller data sets, such as those measured in gigabytes, because there are more manageable and flexible ways to handle smaller data sets. Hadoop is also not the most secure way to store sensitive data, such as personal information on clients and consumers. It doesn’t manage overly complex queries well, and does not allow for functional interactive access to data. What Hadoop handles best is mining extremely large data sets, data exploration when full data sets are available, pre-processing enormous data sets, and allowing for data agility.

Got a question? Need advice? We're just one click away.

Sharing is caring:

Back to articles

Readers also enjoyed:

January 27, 2016

8 Trends in Cloud Computing & Big Data to Watch in 2016

By Daniela Mustatea in What is Big Data

Have you settled into the new year, or at least quit dating everything 2015? Good. That means it's time to take a look at what this year has in store…

February 3, 2017

New Social Network Popular Among Consumers Wary of Big Data

By Daniela Mustatea in What is Big Data

People just want to be seen and heard. That's the human inclination that Facebook so brilliantly capitalized upon. But while you've been busy finding…

December 10, 2015

5 Disruptive Trends That Will Shape Big Data, Hadoop, and Cloud Storage in 2016

By Daniela Mustatea in What is Big Data

The tinsel is hung, Cyber Monday has come and gone, and hopefully you've decided whom to lure under the mistletoe. Instead of sneaking a peek at the gifts…

Your email address will not be published.

4 Lessons Learned from Yahoo's Massive Hadoop Cluster Setup

1. Hadoop is Useful for Combining Different Types of Data

2. Hadoop is Useful for Sharing Data

3. Hadoop With YARN is Useful fir Getting New Employees Started Immediately

4. Hadoop Can’t Replace Some Server Operations

Readers also enjoyed:

8 Trends in Cloud Computing & Big Data to Watch in 2016

New Social Network Popular Among Consumers Wary of Big Data

5 Disruptive Trends That Will Shape Big Data, Hadoop, and Cloud Storage in 2016

Leave a Reply