Understanding what's a datalake and lakehouse: implementation details

28 Views Asked by At

I'm trying to understand the concepts of a datalake and a lakehouse. Most links I've found so far only explain these differences from a macro/high-level perspective, for example this IBM page.

  • datawarehouse -> relational model database (SQL)
  • datalake -> both relational data and semi/un-structured data.
  • lakehouse -> best of both worlds.

I'm having a hard time to understand how exactly one would implement a datalake and a lakehouse.

For a data lake, it doesn't seem enough, at least to me, to have a NoSQL db, like MongoDB... since if we want to store audio or video, transforming it into a format compatible with the json like storing format of MongoDB seems very unnatural... So, we probably should have MongoDB for semi-structured data coupled to a blob storage service, like S3 or MinIO.

Then for a lakehouse, we would improve on the semi-structured data format choosing something like Parquet, and then use a database that could query on parquet files, somehow... From here onwards, I have no idea what else could be done.

It's likely I'm completely missing the point in both concepts. That's why any help would be appreciated.

P.S.: Explanations at a level of a '5-yr old', would be most welcomed... :D

0

There are 0 best solutions below