As part of a recent DR exercise, an availability zone was simulated to have become unavailable. During the exercise, ECS tasks kept trying to start tasks in the "failed/unavailable" AZ.
Is it possible to prevent this situation from happening?
An idea was proposed to use a parallel process to update the ECS tasks with a placementConstraint directive that excluded the unavailable AZ. However, relying on an active process during a disaster seems like a recipe for, well, disaster.
Is it possible to use a static placement constraint that is in place before the disaster event? In other words, is it possible to say "if AZ is unavailable then don't try to start tasks in that AZ".
Thank you
Thanks for sharing this scenario that you and your team were looking into! This is a new feature that we are exploring for our customers and we will share more details as we can.
It would be great to know a little bit more about how you performed your disaster recovery simulation. I'd be curious to hear some high level details on how you simulated an AZ becoming unavailable.
Thanks,
The AWS ECS team