How to design a long running process that can continue after an outtage?

22 Views Asked by At

How to can I make a long running job that fail in the middle of the process continue from the last successful operation?

For example, there's a service that has to notify 1 million users via AWS SNS. The service would have to send a request to SNS one by one for each user. If the service die while trying to notify the 999999th user, then how can I make the restarted service start processing from the last 2 users?

My idea is to use Redis for idempotency. So, it will only notify each user exactly once. The whole operation qs treated as a message on a queue.

The processing service will

  1. Receive a message to notify users
  2. Query users that match the criteria of the job.
  3. Check if the user id is more than the user id on Redis. 2.1 If less than Redis then skip. 2.2 If more than the id on Redis then send a SNS notification for the user. 2.3 Updates the user id on Redis.
  4. Continue to the next user.
  5. Once the job is completed, ACK the message.

This solution seems to work, but after sending a notification for a user, then it could fail while trying to update Redis and cause the user to be notified multiple times.

0

There are 0 best solutions below