Technically Speaking

The Official Bigstep Blog

 

4 Big Goofs to Avoid When Creating Your Data Lake

If you're in the position of managing organizational data, you've probably heard about the concept of data lakes. While data lakes are marked by their size, the primary difference between data lakes and the good ol' data warehouse is that the data lake stores data in its native format. This means that you don't have to determine a use for the data until it's needed. You can store it now and worry about use cases later. Well, sort of. Data lakes are powerful tools as organizations begin to make headway in finding uses and tools to use big data. But you can't just build a data lake and hope people find uses for it. Here are the biggest mistakes to avoid when constructing your data lake.

If you’re in the position of managing organizational data, you’ve probably heard about the concept of data lakes. While data lakes are marked by their size, the primary difference between data lakes and the good ol’ data warehouse is that the data lake stores data in its native format. This means that you don’t have to determine a use for the data until it’s needed. You can store it now and worry about use cases later. Well, sort of. Data lakes are powerful tools as organizations begin to make headway in finding uses and tools to use big data. But you can’t just build a data lake and hope people find uses for it. Here are the biggest mistakes to avoid when constructing your data lake.

1. Failing to Build the Data Lake Around a Comprehensive Data Strategy

The data lake can be part of your data strategy, but simply building a data lake is not a big data strategy.

It’s true that the data lake will hold on to data until you find a use, but you’re setting the entire project up for failure if you think that your data lake is your data strategy. Many businesses go to the time, trouble, and expense of developing a data lake, but fail to build a comprehensive data strategy to define its uses and purposes within the organization. It then becomes like the pond behind Grandpa’s old house—nobody uses it. Develop a company-wide data strategy, and then build a data lake that meets the needs and purposes of your big data plans. Establish policies to encourage (or perhaps mandate) that developers utilize the data lake when creating new applications.

Limited offer! Discover the first Full Metal Data Lake as a Service in the world. Get 1TB free for life - limited to 100 applicants. Start here.

2. Neglecting to Tag the Data with Sufficient Metadata

Without rich, complete metadata, the data lake quickly becomes the data cesspool. Metadata defines what the data is, where it came from, and what quality it is, in addition to what it actually is. You also need to build in a means to track how the data in the data lake is used, how it was accessed, and other historical markers. These tracking methods mean that the data in the data lake will actually be discoverable, searchable, and trackable.

3. Confining the Data Lake to Specific Tools and Products

There are a ton of big data platforms, tools, and products out there, each one with its on list of pros and cons. Some tools are powerfully promising, but too new to offer real reliability and support. Others have been around a while but have certain disadvantages, such as the inability to handle real-time data streaming or being really difficult to program code for. Build a data lake that can be accessed and processed using a wide variety of big data tools, like Spark, Storm, Hive, MapReduce, Tez, Flink, etc.

4. Setting Up Lots of Data Ponds Instead of a Big, Inclusive Data Lake

The usefulness of a data lake is in the fact that multiple teams can leverage it for many different purposes. Don’t lock them in by building a data lake that is only compatible with MapReduce or any other specific tool.

The data lake is supposed to be the end of restrictive, prohibitive data silos, but if you aren’t careful, your organization will end up building lots of little data ponds—too small to be a data lake, and just as isolated as any other data silo you’re trying to replace. Not to mention that this mistake will rack up serious charges for your cloud services, without providing any real value to the organization. Make sure that a data lake initiative includes all of the organization’s data from all of its databases and data warehouses, applications, systems, etc. A data lake should be all or nothing. That isn’t to say that you can’t build a lake and gradually add which systems and sources feed data into it, but it does mean that you start a single lake and don’t let any others spring up elsewhere.

Are you ready to get started on your data lake? Take advantage of the world’s first and only Full Metal Data Lake as a Service at Bigstep. Learn more about us and DLaaS today.

Got a question? Need advice? We're just one click away.
Sharing is caring:TwitterFacebookLinkedinPinterestEmail

Readers also enjoyed:

Get Your Enterprise (NCC-1701D) Ready for Data Lake

Say you’re Captain Picard. Your company is your flagship. Your Enterprise. With the right strategy, your ship can take you anywhere. But it must run flawlessly.…

Is a Data Lake the Better Solution to Your Data Warehousing Issues?

The good old data warehouse has serviced business admirably for decades. Generally structured as a relational database, it is the go-to data resource…

Leave a Reply

Your email address will not be published.

* Required fields to post your comments.
Please review our Privacy Notice in order to understand how we process your personal data and what are your rights in this respect.