Key Features
Cloudera CDH was downloaded by more enterprises than all other such distributions combined.
- High-performance bare metal instances provide consistent computing performance
- High throughput distributed all-SSD storage that offers more IOPS than local storage
- High performance network: 40Gbps/instance for throughput, cut through switching for lowest possible latency
- Easy orchestration and scaling of clusters, at the click of a button or through the API
Deployment and Integration
We deploy the latest version of Cloudera’s distribution of Hadoop (CDH 5.x).
It takes two clicks in the Control Center or one API call to deploy CDH on the Bare Metal Cloud. Scaling is just as easy.
The distribution will expand to the entire size of the underlying Instance Arrays. Account owners get root access to the whole system and can access all the features of CDH. We also pre-configure CDH to connect with other apps in the Metal Cloud.
The most popular choices are Datameer for Analytics, and Exasol’s EXASolution for MPP distributed SQL.
Deployment Architecture
We separate Hadoop services among two types of node clusters: Head Nodes and Data Nodes.
Both can be scaled horizontally or vertically at any time.
- Cloudera Manager
- HDFS NameNode
- HDFS Secondary NameNode
- YARN ResourceManager
- Zookeeper
- Spark 2.1
- HDFS DataNode
- YARN NodeManager
- HTTPFS
Actual interface elements from our Control Panel.
Each interface element groups together in an Instance Array, servers with the same role within the Cloudera cluster.
(headnotes, data nodes)
Our default configuration uses different types of hardware configurations for the Head Nodes and Data Nodes.
Head Nodes
- Instance type: FMCI 20.128
- Solid Store Drive size: 250 GB
- OS: Centos 7.x
- Instance count: 1
Data Nodes
- Instance type: FMCI 20.128
- Solid Store Drive size: 1 TB
- OS: Centos 7.x
- Instance count: 3
You can always add additional services such as Hive or Impala, depending on your use case.
Sizing and Scaling Your Hadoop Cluster
Size your cluster using the table below - we provide some rules of thumb to help you get started. One of the most challenging problems in the Hadoop world is determining how much infrastructure you need from the start as Hadoop’s performance profile is very application-specific.
Sizing the Head Nodes Instance Array
- We recommend starting with an Instance Array of 1 or 2 instances. By choosing an InstanceArray with 2 instances, the high availability functionality will be automatically configured and activated.
- The Secondary NameNode is not for failover purposes despite its name - it is used to accelerate the boot time of the NameNode, and it is good practice to have it on another machine.
Sizing the Data Nodes Instance Array
- The calculator below can be used to size the Data Nodes instance array. Keep in mind that sizing the cluster is very dependent on the use case therefore we recommend benchmarking and adapting the cluster to the actual requirements.
- Impala makes extensive use of memory. If you plan to use Impala, we recommend that you add machines with more RAM and also to allow enough memory for Impala so that the majority of your hot data can be kept in the RAM. The minimum recommended configuration is 128 GB of RAM.
How to Scale Hadoop on Metal Cloud
The bare metal cloud’s unique architecture enables easy orchestration and management of complex clusters, through the API or the Control Center. Scaling is as easy as using a visual slider in the Control Center or making an API call.
Scaling Horizontally
New instances can be added to the CDH cluster in moments.
The CDH deployment scales with the size of the underlying instance array. Adding an extra instance in the array will have it registered in the Cloudera Manager and services will be started on the instances.
Technical Expertise
The configuration of the instances in an array can be redefined at any time. The changes are applied after a simple restart. No additional delays are required for the data to be transferred out of each individual host. In two minutes you are up and running with more power than before.
HDFS Spill out into Metal Data Lake
Do you have colder data that you want to keep outside of your own CDH cluster, but just in reach? The Bigstep metal data lake is a fully managed HDFS solution you can use to store petabyte scale data. You choose the number of replicas and only pay for the capacity stored.
- Optimized for high throughput - The service is designed to handle high bandwidth.
- Built for large files - There is no limit on file size.
- Secure - Traffic is encrypted and authentication is done with Kerberos.
- POSIX style permissions - Use chmod, chown and traditional ownership attributes.
- Native support in many big data applications - Tools like Apache Spark, 0xdata’s H2O and Datameer can directly connect to the service and store compressed sequence files.