We already have an Azure Data Factory in place for other processes we run and hence are trying to use it to solve the following:
What we already have:
- ADF deployed and running, we'd just create a new pipeline/data flow;
- Linked Service to access the required folders (as on premises).
What we don't have at the moment but will likely be needed for the end solution:
- Any type of storage in our resource group, be it a DB, Blob Storage, etc. We can however use the FS we have access via the Linked Service as volume is not high and we'll not use the data here outside the pipeline itself.
Use case:
- There are XML files placed in a folder "F1" that are generated and named by a system that connects to a FIX service of our local Stock Exchange. We cannot change anything from this process;
- This is a very basic process and as the files are named by this system without any consideration to what they are, it so happens that a trade that might have been downloaded before gets downloaded again and given another file name, but the contents of the XML file are the same;
- So we need a way to get such files and store some information about it somewhere we can lookup later to see whether we need to send this "new file" being fed to us or ignore it as it is a duplicate. The current "ETL" which is to be decommissioned by the solution being discussed here hashes the contents of each file, but that is overkill as there is at least one tag within each XML that would uniquely identify it (something like );
- Ultimately, we need to also zip the collection of XML files that are not duplicates to be sent downstream to another system by placing this zip in a folder "F2";
- Each day we start fresh with no need to lookup whether a trade today was reported yesterday, so we can wipe any storage solution clean at the end or beginning of each day.
Where I am at is that from all I have read I am not sure all of this can be done inside ADF, meaning all cases I have read so far points to me having to at least having to deploy an Azure Function or a Logic Apps instance to extract and/or compare the content of the XMLs that is relevant to the "duplicate or not" assessment, which is possible but undesirable (extra costs and red tape regarding approval of such solution, etc.), so it would be best if we can solve this without having to deploy any other solution other than ADF. I also I find it odd that ADF can't do it by itself.
As for the storage solution, as it is just about the process in itself and nothing will be re-used past each day or in any other process, I can't judge what would be best to use either (DB? Blob?, the FS itself?), but I'd stick with the FS if it does not create any issues.
I'm just getting started with Azure products, hence the overall confusion, but I'll get there.
If every file size of yours is less than 5000 rows, then you can try below pipeline design. Here, it requires two pipelines which are parent and child pipelines.
Here, you need to use dataset parameters for the datasets which are used inside the for-loops or which are used in the child pipeline. Check this SO answer to know about the usage of dataset parameters in ADF pipelines.
If the size is more than the given rows, then it's better to use other services like functions or logic apps.