How to can I make a long running job that fail in the middle of the process continue from the last successful operation?
For example, there's a service that has to notify 1 million users via AWS SNS. The service would have to send a request to SNS one by one for each user. If the service die while trying to notify the 999999th user, then how can I make the restarted service start processing from the last 2 users?
My idea is to use Redis for idempotency. So, it will only notify each user exactly once. The whole operation qs treated as a message on a queue.
The processing service will
- Receive a message to notify users
- Query users that match the criteria of the job.
- Check if the user id is more than the user id on Redis. 2.1 If less than Redis then skip. 2.2 If more than the id on Redis then send a SNS notification for the user. 2.3 Updates the user id on Redis.
- Continue to the next user.
- Once the job is completed, ACK the message.
This solution seems to work, but after sending a notification for a user, then it could fail while trying to update Redis and cause the user to be notified multiple times.