I have a workflow process like:
- Load data from a DB - Extract
- Transform the data based on Business Rules - Transform
- Publish the processed data to event stream or message queue
Now in the extract task, I need to validate or de-dup each record. There is a micro-service which has an API to validate or read data based on resource id.
My question is:
- Is it a good practice to invoke the API to validate each record in the batch which can be 40K?
- Is it a good practice to have the records put into a event stream or message queue which then read by the Validator Microservice and send the verdict back as message or event which is read by the extract task? If so it means that the flow is no more async and needs to wait to proceed.
- Read replica of the data store or table in the batch ingestion space context, so that the de-dup logic can be in the batch flow?
I know event may not be as used as it notifies some thing has happened or changed.
Please share thoughts
Thanks in advance