5 Best Practices to Assure Your Data Lake is Swimmingly Successful
As big data becomes a mainstay in the business, many organizations are abandoning the data warehouse for data lakes. With a data lake, you don’t have to worry about the relationships among the data or what the data is good for. You just pour all the data in and let it swim around until you’re ready to use it. When you’re ready to get started building and filling your data lake, here are some best practices to keep in mind for success.
1. Don’t Worry About What You’re Going to Do With All That Data
If you’ve spent your career working with relational databases like SQL it’s going to be really hard for you to build a database without nailing down exactly what the data will be used for and how. Take a deep breath and do it anyway. The beauty of a data lake is the ability to store all kinds of data that would normally go to waste, even if you don’t figure out a use for it for some time. Think of the data lake like your junk drawer at home: somewhere to stick all that miscellaneous junk until it’s time to pull it out and use it. One day, data that looked pretty worthless might yield tremendously valuable data on BI, your customers, or even ways to get ahead of the competition.
2. Find a Good Data Scientist (Hint: They Don’t Exist)
Data scientists are like elves and leprechauns. You hear a lot about them, but darn if you ever meet one in person. But a good data scientist can be developed instead of hired. Many companies find that a team is more successful than a single data scientist, anyway. Look for strong mathematical skills, particularly in the area of statistical analysis. Combine that with some savvy programming talents, and combine this with someone who has a good grasp of the business side of things. Sprinkle with salt and pepper, allow to marinate for a few months (preferably while studying Hadoop and data analytics) and Bam! A data scientist will emerge to help you develop and manage your data lake.
3. Decide on a Platform for Your Data Lake (Hint: It’s Hadoop)
Hadoop isn’t fast, it isn’t easy, and it is not necessarily cheap. But it is highly effective for managing and analyzing enormous sets of unstructured data, such as you’ll be dealing with in your new data lake. You can search and search, but you won’t find a better option for data crunching and munching than Hadoop on full metal. Just remember, it will take time to get a handle on, especially if you’re home-baking a data scientist or two.
4. Find New Sources to Feed Your Data Lake
One of the most powerful advantages of a data lake is the ability to stream in lots of different data from disparate sources without having to worry about what it’s going to do until much later. Stream away! Begin offloading data from various systems and your data scientists will soon find uses for most, if not all, of it.
5. Stay on Top of Capacity Planning
The thing about data lakes is that, since you’re essentially pouring in everything but the kitchen sink (and perhaps even that), the capacity tends to grow substantially more and faster than the scale of a typical data warehouse. Hence, it’s important to stay on top of that capacity planning. Partner with a solid DBaaS provider that will be able to offer you the scalability you need to maintain and manage your data lake.
A data lake will serve as your repository for the data you need heading into the future.