Cloudera on Bare Metal Cloud

Key Features

Cloudera CDH was downloaded by more enterprises than all other such distributions combined.

High-performance bare metal instances provide consistent computing performance
High throughput distributed all-SSD storage that offers more IOPS than local storage
High performance network: 40Gbps/instance for throughput, cut through switching for lowest possible latency
Easy orchestration and scaling of clusters, at the click of a button or through the API

Deployment and Integration

We deploy the latest version of Cloudera’s distribution of Hadoop (CDH 5.x).

It takes two clicks in the Control Center or one API call to deploy CDH on the Bare Metal Cloud. Scaling is just as easy.
The distribution will expand to the entire size of the underlying Instance Arrays. Account owners get root access to the whole system and can access all the features of CDH. We also pre-configure CDH to connect with other apps in the Metal Cloud.
The most popular choices are Datameer for Analytics, and Exasol’s EXASolution for MPP distributed SQL.

Deployment Architecture

We separate Hadoop services among two types of node clusters: Head Nodes and Data Nodes.
Both can be scaled horizontally or vertically at any time.

Head Nodes Instance Array

Holds administrative services and HDFS metadata.

Cloudera Manager
HDFS NameNode
HDFS Secondary NameNode
YARN ResourceManager
Zookeeper
Spark 2.1

Data Nodes Instance Array

Executes actual computing tasks.

HDFS DataNode
YARN NodeManager
HTTPFS

Actual interface elements from our Control Panel.
Each interface element groups together in an Instance Array, servers with the same role within the Cloudera cluster.
(headnotes, data nodes)

Our default configuration uses different types of hardware configurations for the Head Nodes and Data Nodes.

Head Nodes

Instance type: FMCI 20.128
Solid Store Drive size: 250 GB
OS: Centos 7.x
Instance count: 1

Data Nodes

Instance type: FMCI 20.128
Solid Store Drive size: 1 TB
OS: Centos 7.x
Instance count: 3

You can always add additional services such as Hive or Impala, depending on your use case.

Sizing and Scaling Your Hadoop Cluster

Size your cluster using the table below - we provide some rules of thumb to help you get started. One of the most challenging problems in the Hadoop world is determining how much infrastructure you need from the start as Hadoop’s performance profile is very application-specific.

Sizing the Head Nodes Instance Array

We recommend starting with an Instance Array of 1 or 2 instances. By choosing an InstanceArray with 2 instances, the high availability functionality will be automatically configured and activated.
The Secondary NameNode is not for failover purposes despite its name - it is used to accelerate the boot time of the NameNode, and it is good practice to have it on another machine.

Sizing the Data Nodes Instance Array

The calculator below can be used to size the Data Nodes instance array. Keep in mind that sizing the cluster is very dependent on the use case therefore we recommend benchmarking and adapting the cluster to the actual requirements.
Impala makes extensive use of memory. If you plan to use Impala, we recommend that you add machines with more RAM and also to allow enough memory for Impala so that the majority of your hot data can be kept in the RAM. The minimum recommended configuration is 128 GB of RAM.

How to Scale Hadoop on Metal Cloud

The bare metal cloud’s unique architecture enables easy orchestration and management of complex clusters, through the API or the Control Center. Scaling is as easy as using a visual slider in the Control Center or making an API call.

Scaling Horizontally

New instances can be added to the CDH cluster in moments.
The CDH deployment scales with the size of the underlying instance array. Adding an extra instance in the array will have it registered in the Cloudera Manager and services will be started on the instances.

Technical Expertise

The configuration of the instances in an array can be redefined at any time. The changes are applied after a simple restart. No additional delays are required for the data to be transferred out of each individual host. In two minutes you are up and running with more power than before.

HDFS Spill out into Metal Data Lake

Do you have colder data that you want to keep outside of your own CDH cluster, but just in reach? The Bigstep metal data lake is a fully managed HDFS solution you can use to store petabyte scale data. You choose the number of replicas and only pay for the capacity stored.

Optimized for high throughput - The service is designed to handle high bandwidth.
Built for large files - There is no limit on file size.
Secure - Traffic is encrypted and authentication is done with Kerberos.
POSIX style permissions - Use chmod, chown and traditional ownership attributes.
Native support in many big data applications - Tools like Apache Spark, 0xdata’s H2O
and Datameer can directly connect to the service and store compressed sequence files.

Try Cloudera on Bare Metal

Request a Demo