Building Data Lakes in the Cloud
Understand why building a data lake in the cloud entails different particularities than building it on premises
Every industry has both proven and potential data lake use cases. With enterprise data warehouses (EDWs) being rendered ever more inefficient when facing new business needs, cloud-based data lakes have been gaining popularity with enterprises looking to cover the technology gap. Cloud data lakes are purpose-built to meet the data management requirements of the evolving enterprise landscape.
Here’s a brief walk through the steps required to build a data lake in the cloud and connect it to on-premises environments, covering best practices in architecting cloud data lakes and key aspects such as performance, security, data lineage, and data maintenance. The technologies presented range from basic HDFS storage to real-time processing with Spark Streaming.
Why Build Data Lakes in the Cloud?
The main drivers for enterprise adoption of the data lake have been the need for agility and custom, enterprise-wide access to datasets, data streams, and data analysis tools. However, more and more companies have started using cloud data lakes as prototyping workbenches and have embraced the researcher mindset in order to build fully functional data laboratories in the cloud.
Apart from offering an extremely convenient method of bypassing the tedious integration and configuration of big data applications and the costly acquisition and tuning of the underlying on-premises infrastructure, using a cloud data lake offers the opportunity to experiment with an ever growing array of big data technologies.
Solutions for securely extending the on-premises network in the cloud
To protect against unauthorized access, the data lake uses computer network authentication protocols, such as Kerberos, and it encrypts data both when transmitted across networks and while at rest.
Security measures suited to cloud data lakes must also cover efficient backup protocols. Ideally, replication is configured on a per-file basis so users can decide the extent to which the most sensitive data is safeguarded against loss.
Integration solutions for multiple Active Directory domains and multiple secure Hadoop environments
The data lake should easily integrate any corporate Active Directory (LDAP) or third-party authentication method. Identity services integration is crucial when building a data lake in the cloud. As data provisioning, management, and governance become easier and safer, cloud-based Hadoop architectures better mirror and seamlessly integrate with on-premises architectures.
Increase the performance of your data lake with the bare metal cloud
Despite its impact on the IT landscape, the virtualized cloud is far from being the best underlying architecture solution for data lakes—and for big data projects in general. A fairly new breed of cloud, the bare-metal cloud, offers a much better environment in terms of performance, isolation, and flexibility. Platforms offering such environments provide the high computation power and security of bare metal with the full flexibility of the cloud.
Software solutions typically used for data lakes
From concept to deployment, creating a production-ready enterprise cloud data lake should take minutes, not months. Every hardware connection should be software-defined, and every software component should be ready for deployment, scaling, and connecting to a data source. Along with the data lake’s powerful processing capabilities, its software stack is the main aspect that differentiates a cloud data lake from large-scale storage repositories such as an enterprise data warehouse.
Alex Bordei, Head of Product Management at Bigstep, held a presentation at Strata New York about building data lakes in the cloud. Addressed to attendees who are familiar with enterprise IT challenges and big data technologies, the session aimed to deepen the understanding of the differences between building a data lake in the cloud or on premises. Alex talked about reasons and benefits for choosing the cloud, security concerns, performance solutions and software issues in the data lake environment.
As EDWs are outshined by cloud-based Data Lakes, Alex Bordei made it clear that on premises is no longer a viable option for data management and all the complex operations that entail analyzing huge data sets in order to derive data-driven decisions and business intelligence. Also, Alex presented how data lakes can be safely and easily architected within the cloud and connected to on-premises environments.
Read the full presentation here.
About the speaker
Alex Bordei is Head of Product Management at Bigstep, where, after successfully launching two public clouds based on VMware software, he created the first prototype of the Bigstep platform. Alex has been developing infrastructure products for over nine years. Previously, he was one of the core developers for Hostway Corporation’s provisioning platform before focusing on defining and developing products for Hostway’s EMEA market, becoming one of the pioneers of virtualization in the company. Alex is engaged in mapping out ever more useful perspectives on the big data paradigm in order to encourage exploration and innovation through big data.