4 Lessons Learned from Yahoo's Massive Hadoop Cluster Setup
Yahoo! has become the largest user of Hadoop, establishing a cluster setup comprised of 10,000 CPUs located in more than 40,000 servers with 4,500 nodes. This Hadoop cluster setup manages 455 petabytes of data for Yahoo!, which amounts to more than 1,820 times the amount of data held by the Library of Congress. Obviously, and undertaking of this magnitude leads to a lot of learning on the issue, and here is what can be gleaned from Yahoo!‘s transition to Hadoop clusters.
1. Hadoop is Useful for Combining Different Types of Data
When enterprises hold a wide variety of data types, such as structured and unstructured data sets like Yahoo! has, Hadoop is an excellent way to manage it. Hadoop easily handles structured data sets, like data within ERP systems, alongside semi-structured data sets like file logs, and completely unstructured data like videos. Once on a Hadoop cluster, it is easy to add to data sets and amend existing data, which is important for enterprises like Yahoo! that are constantly updating their body of data with new information on page views, click streams, photos, videos, and much more.
2. Hadoop is Useful for Sharing Data
Not only can Hadoop clusters hold these enormous mixed data sets, it is also an easy way to house it in a single place for collaboration. Yahoo! has a variety of data that needs to be accessible by large teams of workers and still remain relatively secure. Yahoo! uses YARN, formerly called MapReduce, to run a full range of jobs within those data sets. Currently, Yahoo! has 32,000 nodes within 16 clusters that run YARN.
3. Hadoop With YARN is Useful fir Getting New Employees Started Immediately
According to Peter Cnudde, VP of Engineering for Yahoo!, new employees can come in and become productive immediately using Hadoop, especially when running YARN. As it becomes harder and harder to find qualified talent, this is an excellent way to stay productive when there aren’t lots of candidates out there who have much experience working with enormous data sets like Yahoo! manages.
4. Hadoop Can’t Replace Some Server Operations
Yahoo! didn’t move all server processes to Hadoop clusters, however. Though they use Hadoop to scan their email for spam, they do not use it as the email server. Also, Yahoo! uses Hadoop to run image recognition on Flickr photos, but does not use it as their photo server.
When is Hadoop not the best solution? A Hadoop cluster setup isn’t the best option for handling real-time analytics, because it’s simply not that fast. It also isn’t the best solution for smaller data sets, such as those measured in gigabytes, because there are more manageable and flexible ways to handle smaller data sets. Hadoop is also not the most secure way to store sensitive data, such as personal information on clients and consumers. It doesn’t manage overly complex queries well, and does not allow for functional interactive access to data. What Hadoop handles best is mining extremely large data sets, data exploration when full data sets are available, pre-processing enormous data sets, and allowing for data agility.