Technically Speaking

The Official Bigstep Blog

 

What Goes Into Designing a Successful Data Lake?

Data lakes are a relatively new concept. Data lakes are the solution to data silos, where data gets locked away and becomes inaccessible to other systems, departments, and users. For a more technical explanation, data lakes are architectures centered around the collection, storage, and management of data, designed to house huge quantities of data that are stored in their native format (or darn close to it).

Data lakes are a relatively new concept. Data lakes are the solution to data silos, where data gets locked away and becomes inaccessible to other systems, departments, and users. For a more technical explanation, data lakes are architectures centered around the collection, storage, and management of data, designed to house huge quantities of data that are stored in their native format (or darn close to it).

The beauty of a data lake is that you can set up a data lake without knowing exactly how the data will be used, because it doesn’t have to be formatted yet. For example, you can stick in all the log data from your Web servers, along with your social media data, data from proprietary software, and all the stuff out of your other databases. It’s like the patchwork quilt of data storage.

Data lakes are also useful because these can manage the 4 V’s of Big Data: Volume, Variety, Velocity, and Veracity. Especially when you elect to built your data lake in the cloud (DLaaS), data lakes are as scalable as you need. Data can be fed into the lake via batch processing or real-time streaming. Additionally, data can be pulled out of the data lake by anyone in the organization and used for an infinite number of purposes. If, for example, multiple departments in your organization are delving into their own, separate data agendas (marketing wants customer data, finance wants sales and production data, the executives want real-time BI, etc.), a data lake allows you to service all of them without compromising the data or hamstringing them with application-specific formatting. For once, IT can make everyone happy.

So, what does it take to build a smart, sensible, valuable data lake?

Don’t Limit Yourself in Terms of Tools and Solutions

Data lakes are repositories of data stored in their original format. While a good design assures your data lake is as placid as a picture, the power and potential of big data are always there, beneath the surface.

It’s really better not to nail yourself down with a single solution like Hadoop YARN, Spark, Storm, etc. Since both the data and the uses for the data are by nature varied (that’s kind of the whole point of a data lake), you don’t want to box anyone in with a specific platform. For instance, you might use MapReduce for some of the deep analytical tasks that aren’t too time sensitive, while springing for something new and edgy like Flink for the real-time streaming stuff that won’t tolerate latency. Build a tool-agnostic data lake for maximum usefulness.

Manage Metadata Automatically and Smartly

How that data gets tagged before it’s stuffed into the data lake will determine how (if?) it’s ever retrieved again and put to use. Without a good, automated system for adding metadata to your data, the data lake will quickly become a data graveyard. Tag it with information on where it came from, its quality, and its use history.

Design Your Data Lake to Accept New Types of Data Easily

When you build your data lake, you can’t necessarily foresee what you’ll want to add to it in the future. Perhaps you set it up to accept lots of social media posts and text documents, but over time you want to add data from a bunch of open source and proprietary applications. Again, it’s about not hamstringing yourself for future uses you find for the data lake. Set it up from the start to be flexible enough to handle a wide variety of data from various sources in wildly different formats.

Set Up the Data Lake to Integrate Nicely with the Rest of Your IT Environment

With a data lake, your teams can develop ideas years from now and still have access to all they data they need, stored in its native format. This means that analysts aren’t boxed in—they can use the data however they need to.

Islands aren’t cool in data lakes. That is, the data lake can’t be an island unto itself. It needs to be integrated well into the overall IT infrastructure, or no one will ever bother pulling data out of it to use, and you’ve wasted all your time and effort building a data wasteland.

By taking time to design the data lake from flexibility from the start, your company can take its time finding powerful uses for the data you are able to collect and store. While most companies are still reeling about, wondering what to do with all this Big Data nonsense, your teams can be quietly squirreling away data, preparing for all the great uses that have yet to identify themselves.

Bigstep is offering the world’s first Full Metal Data Lake as a Service. See our products and learn more about setting up your data lake today.

Got a question? Need advice? We're just one click away.
Sharing is caring:TwitterFacebookLinkedinPinterestEmail

Readers also enjoyed:

4 Big Goofs to Avoid When Creating Your Data Lake

If you're in the position of managing organizational data, you've probably heard about the concept of data lakes. While data lakes are marked by their…

5 Best Practices to Assure Your Data Lake is Swimmingly Successful

As big data becomes a mainstay in the business, many organizations are abandoning the data warehouse for data lakes. With a data lake, you don't have…

Leave a Reply

Your email address will not be published.

* Required fields to post your comments.
Please review our Privacy Notice in order to understand how we process your personal data and what are your rights in this respect.