Technically Speaking

The Official Bigstep Blog

Subscribe to our newsletter

Understand the Differences Between Data Warehouses and Data Lakes

To understand what data warehouses and data lakes are, let’s start with language: why is a data lake called a data lake and a data warehouse a data warehouse? We then move on to understand the main characteristics of the two, understand them through use cases and ultimately help you identify which is better suited for your business. If you already know what data warehouses and data lakes are, you can go straight to Point 2 and 3 for the use cases and characteristics.

  1. Understanding Data Warehouses and Data Lakes

What is a warehouse? A warehouse is a place where products are stored before they are used, distributed, or placed on sale. Usually, the goods are set in an orderly fashion, by type of product. The warehouse is organized to find the products when you need them quickly, and to have an organized storehouse: you need to know where to place new goods the moment they arrive.

In the future, this work will probably be done by AI robots, so having an organized warehouse is essential to having an orderly business as you don’t want to get the AI confused (if, for instance, you mix cereals with paper towels).

Similarly, a data warehouse in computing is a system based on the idea of traditional warehouses: structured data, organized by type. For instance, if you need to gather data from a specific document, the system will take each piece of information and place it exactly in its right spot.

Let's take drivers’ licenses as an example: you have some drivers’ licenses that you need to upload data from. The way in which a data warehouse works is that it will take information such as first name, last name, date of birth, age, address, and so on and upload it in an orderly fashion, in the already created fields for that particular type of information.

These fields do not accept other types of data, for instance for the first name and last name the input must be ‘char,’ while the date of birth and age must be ‘int.’ Any information that does not have a place will not be uploaded, the same way in which if it's the first time you receive specific products you will leave them aside until you find a particularly assigned spot for them in a traditional warehouse.

What this means, is that a data warehouse has a "schema on write" approach. “Schema-on-Write” means that the data must be refined, transformed, and structured to fit the already defined data structure perfectly. One typical data warehouse system is ETL-based, meaning “extract, transform, and load.”

What about a lake? A lake is a large pool of water, where you can find all sorts of plants and animals, including fish, turtles, and algae. Although each has its own characteristics and a specific role in the ecosystem, it is far less restricted in terms of positioning.

Similarly, a data lake is made up of structured, semi-structured and unstructured data all at once. For instance, if you need to organize both drivers’ licenses, photographs, and tweets all at the same time, the data will be taken directly from the data lake. Since data in the data lake is raw, you have access to all information. Moreover, the data lake is perfect for large tasks or data sets.

This means that a data lake has a "schema-on-read" approach, more precisely, that data is applied to a plan as it is taken out of the stored location, rather than when it goes in. In Schema-on-Read, you load all of the data exactly as it is, and only apply your features to the data when you take it back out. You may extract the information you need from all of the data you have, irrespective of what type or what source it comes from.

  1. Data Warehouse Use Case and Characteristics

Let’s take SmartHR, a fictional HR company from the EU that uses its custom software to input data about potential candidates and their backend is a SQL database. All the data SmartHR collects comes from structured repositories or direct human input from their website or software into the database.

The database stores data about candidates, inputted by candidates themselves on the SmartHR website or by SmartHR employees. The company also stores invoices to clients and client data in the same database. In their specific use case, data is coming from many different source systems such as website, billing, CRM, and they are using the data warehouse to cross-reference this information to get a clearer understanding of their users and candidates.

  • Efficient On Low Data Volumes

SmartHR does not collect a lot of data. Their whole candidate/client databases is a mere 50 GB of data. On large volumes of data, data warehouses become costly and inefficient, but it is not the case of SmartHR.

  • Works Without Data Scientists

SmartHR has a CEO and a CMO, as it is a small company. By owning a data warehouse, SmartHR does not need a data scientist, as both CEOs can handle the structured data themselves.

  • Security

SmartHR stores some sensitive data in their data warehouse. Security in the data warehouse is more mature as compared to security in a data lake, so SmartHR may choose to prioritize security.

There are other ways of storing data, like the blockchain, which is extremely secure, but costs are of course much higher.

  1. Data Lake Use Case and Characteristics

Let’s take SmartShip, another company from the EU that I just made up. SmartShip lives from commissioning international shipments. SmartShip has a small team and small-to-medium annual turnover, and they use a data warehouse, where they store shipment details, invoices, and so on.

  • Scalability & Cost Efficiency

Data lakes are highly scalable and fit your organization both as infrastructure and resources needed and are also as cost-efficient.

SmartShip gets more customers and increases its annual turnover. The current data warehouse cannot handle more data. SmartShip has a nightmare with their mode of storing data, and a lot of time and workforce is wasted. Because of the volume of data, the manpower cannot handle and starts to mix up orders, and SmartShip loses customers.

  • Independence & Malleability

Data warehouses may limit your freedom and adaptability, as they follow a single, rigid data model.

The GDPR regulation strikes SmartShip. The way they store their data for years is not GDPR-compliant, so the whole database system has to be reworked to protect private data. With a data lake, SmartShip could have done a few tweaks and conform to the new regulation.

Because of its data warehouse and the way, it was designed years ago, SmartShip incurs great legal advice, and IT costs.

  • Ability to Handle Streaming

Data lakes can handle streaming and input in real-time from IoT devices.

SmartShip just installed some tracking devices on all cargo. The data warehouse cannot receive and store real-time data, so SmartShip needs a new custom data storage to handle the tracking. It costs a lot of money and has now two different storage places, so they cannot cross-reference data from the cargo tracking with the other data SmartShip collects, because they are stored in two different places.

Data lakes enable you to:

  • Run big data analytics:

You now may live without big data, but in time you won’t manage to escape without adopting it.

  • Run machine learning on top:

Once you get a feel for the uses of big data analytics, you can’t keep yourself from benefiting from the multiple applications of machine learning as well.            

Even though data lakes are most useful in a use case like SmartShip, your organization may not need it. For example, SmartHR at its current stage of development is better suited to a data warehouse.

However, an HR company could scrape data from websites and let’s say LinkedIn and find better suggestions of employees. So a use for a data lake may arise.

 

  1. Other Differences Between Data Warehouses and Data Lakes

Data Warehouses

Data Lakes

 

 

Query on a single server.

 

If the query is too big, it takes longer, as it is stored on a single server.

Queries function on more than one server, and each of them has a particular storage system. (Hadoop)

In the end, it takes all of the results and adds them up.

It is only fast when there is not a lot of data.

Faster response time/query.

Expensive per GB.

More affordable per GB.

 

Open-source.

Strict.

Flexible.

 

Machine Learning.

 

  1. Conclusion

In the end, we give you two crucial not to’s, so that you don’t spend resources on moving your data yet if your organization does not require it. Be prepared, identify your business challenges first, get a team and professional support, and only then proceed to operate changes.

  • Don’t go in unprepared

Identify your business challenges correctly, as all technology updates need to support business issues.

  • Get executive buy-in first

Do not engage into moving your data to a data lake without proper operational support: it may turn rancid.

For any questions, please ask us for support.

Got a question? Need advice? We're just one click away.
Sharing is caring:TwitterFacebookGoogle+PinterestEmail

Readers also enjoyed:

3 Predictions for Hadoop & Big Data Between Now and 2021

The past few years have brought some growing pains for big data and Hadoop. They had to fight the reputation that they were just buzzwords and had nothing…

3 Reasons Why the Bare Metal Cloud is Blossoming

The bare metal cloud is the latest trend in the cloud computing space, which has taken the IT world by storm. Cloud computing, or the ability to store…

Leave a Reply

Your email address will not be published.

* Required fields to post your comments.
Please review our Privacy Notice in order to understand how we process your personal data and what are your rights in this respect.