I want to replace and index with zero-downtime, as described in the ES documentation.
I am doing so by:
- creating a new index
my_index_v2
with the new data - refreshing the new index
- then swapping them in an atomic operation, by performing the following request:
POST /_aliases
{
"actions": [
{ "remove": { "index": "*", "alias": "my_index" }},
{ "add": { "index": "my_index_v2", "alias": "my_index" }}
]
}
This works as expected, except when it randomly fails with 404 response. The error message is:
{
"error": {
"root_cause": ... (same)
"type": "index_not_found_exception",
"reason": "no such index",
"resource.type": "index_or_alias",
"resource.id": "my_unrelated_index_v13",
"index": "my_unrelated_index_v13"
},
"status": 404
}
- Afterwards, and only if it the swap worked, we delete the now unused indices that were associated with this and only this alias.
The whole operation happens periodically every few minutes. Similar operations to the one described might happen at the same time in the cluster, on other aliases/indices. The error happens randomly, every several hours.
Is there a reason why these operations would interfere with each other? What is going on?
EDIT: clarified the DELETE step at the end.
This is difficult to reproduce on a local environment because it seems to only happen on highly concurrent scenarios. However... as pointed out by @Eirini Graonidou in the comments, this really looks like an ES bug, solved in PR 23153
From the pull request (emphasis mine):
This does not explain the "bad request" situation, but definitely explains why the error message does not make sense.
More importantly: Upgrading elasticsearch solves this issue