I am designing ADF solution to read multiple source folder from DataLake raw layer, I need to build a config file where I can manage dynamically which file/folder to load, and date range to load
i.e. The structure of the folder in the container as follow:
Like above screenshot I need to load the file into raw to staging layer dynamically having config file assigning the path and file types. we are receiving multiple types of files csv, json, etc.
any solution please share
I am starting new design
To achieve your requirement, first you need to generate a file containing the list of required source file paths.
To generate that file, use dataflow. First take a
DelimitedTextfile and give the path till your root container with no Column delimiter in it. Give this dataset as source of the dataflow.You mentioned that your source files are in the directory structure like
root(crm)/src1/2024/03/22/. So, give the wild card file path as per the structure*/*/*/*/*. Add a columnfilepathin the source settings.Import the projection and check the data preview. You will get two columns in which one contains the file paths of all types of source files.
Next, use aggregate transformation and use groupBy on the
filepathcolumn. In the aggregate section create a new columncountand take count of filepath (count(filepath)) column for sample.It will give all distinct filepaths and their count. Next use a select transformation and remove the extra
countcolumn from the flow.Now, take a derivedColumn transformation to the
filepathcolumn and use expressiondropLeft(filepath,1)in it. This will remove starting/in every row.Now, take a
filter transformationand filter out the required file paths. You can use Dataflow expression and functions as per your date ranges condition in the filter transformation.Next, take another
DelimitedTextdataset as sink of the dataflow and give the file path till the temporary container in the dataset. Here, make sure you remove the First row as Header checkbox.Go to the sink settings of the dataflow and follow the below configurations.
Upon executing the dataflow from the pipeline, the
filepaths.csvfile will be generated like below with required filepaths.After the dataflow activity, take a copy activity in the pipeline with two Binary datasets as source and sink of it. Give the path till the source container (root
crm) in the source Binary dataset and till the target root container in the target Binary dataset.In the copy activity source, select List of files option and give the above
filepaths.csvfile.Now, execute the pipeline and all the paths in the
filepaths.csvfile will be copied to the target container maintaining the same file structure.