Optimal approach in accessing datalake container with multiple teams?

78 Views Asked by At

We have a datalake folder "data/landing", there are several subfolders in the "data/landing" folder, each folder represents a table that is extracted from an On-Prem database on daily basis and data is stored in parquet files.

Ex: data/landing/customer/customer_20240319.parquet, data/landing/customer/customer_20240318.parquet,
data/landing/employee/employee_20240319.parquet, data/landing/employee/employee_20240318.parquet

We perform archiving of data into "data/archive" folder and we only store 3 days of data in the "data/landing" folder for each table and move rest of the data to archive folder once the data is finished processing

Ex: data/archive/customer/customer_20240316.parquet, data/archive/customer/customer_20240315.parquet, 
data/landing/employee/employee_20240316.parquet, data/landing/employee/employee_20240315.parquet

We then create an external table for each folder in "landing" folder in the dedicated SQL pool to read the data and process the data to load datawarehouse.

Basically at this point the data we extracted from the on-prem SQL server for each table resides in 2 folders "data/landing" and "data/archive". Daily, we will only access "data/landing" folder data through external tables to load the deltas into the datawarehouse.

Now, a marketing team in the organization would like to access the above data that resides in the above datalake folders ("data/landing" and "data/archive") but they need the entire dataset in one folder so that they can have access to the historical data as well to build their own analytical solution.

What would be the optimal approach for this scenario?

I thought about three approaches:

Approach#1: Create a separate landing folder (data/marketing) in the datalake for the marketing team and a separate schema in the datawarehouse (ex: stg_abc) for the external tables to live. Create a pipeline, use Copy activity to copy all the files from "data/landing" and "data/archive" folders for each table into the data/marketing. Create external tables for each table in stg_abc schema on top of the data/marketing sub folders for marketing team to access the data. Downside to this approach is the performance impact on the queries when a join is performed against several external tables in the dedicated sql pool.

Approach#2: Spin up a new Azure SQL database resource, follow steps in Approach#1 to get the data into data/marketing folder and create a staging area to materialize the data so that the data lives in a table rather than an external table paving the way for better query performances.

Approach#3: Follow the steps in Approach#1 and instead of making use of dedicated sql pool to create external tables, read the files in serverless sql pool and the read the files through openrowset for faster reads. But I am not sure how to join between multiple files through serverless sql pool and how effective the performance would be.

I am not sure if above approaches are optimal, would really appreciate any suggestions and recommendations to get to the optimal approach.

Thanks!

0

There are 0 best solutions below